Hello,
I'm examining my tophat output data and I would like to have some opinions on my align_summary.txt.
For example for one alignement I have this summary for a RNA-seq alignment. The quality score is good, and I trimmed to have fixed read lenth at 50n.
Left reads:
Input: 183588390
Mapped: 174579844 (95.1% of input)
of these: 36018754 (20.6%) have multiple alignments (48483 have >20)
Right reads:
Input: 183588390
Mapped: 174631890 (95.1% of input)
of these: 40873773 (23.4%) have multiple alignments (75889 have >20)
95.1% overall read alignment rate.
Aligned pairs: 168870512
of these: 24343928 (14.4%) have multiple alignments
and: 14517921 ( 8.6%) are discordant alignments
84.1% concordant pair alignment rate.
I'm happy with a 95% alignment of input, but I'm not sure where the multiple alignments come from nor whether the rate is acceptable.
The raw data comes from a RNA-seq so they should not be aligned to repeted sequences. I suppose the multiple alignments come from mismatches.
I can always discard this reads but I'd like to know if it's ok to keep them.
All comments are welcome. Thanks.
Thanks for your reply.
It's assuring to know my data are good XD
I've run fastQC before tophat and it's very satisfying. The only reason I trimmed them was to be able to apply MISO (http://genes.mit.edu/burgelab/miso/docs/) afterwards.
Miso demands that all reads to be the same length. A part of my raw data contained reads of different sizes, so I had to trimmed everything to 50n so that I would not lose to many reads.