Tophat2 align_summary interpretation

Question: Tophat2 align_summary interpretation

4.4 years ago by

Canada

Delong, Zhou • 140 wrote:

Hello,

I'm examining my tophat output data and I would like to have some opinions on my align_summary.txt.

For example for one alignement I have this summary for a RNA-seq alignment. The quality score is good, and I trimmed to have fixed read lenth at 50n.

Left reads:
Input: 183588390
Mapped: 174579844 (95.1% of input)
of these: 36018754 (20.6%) have multiple alignments (48483 have >20)
Right reads:
Input: 183588390
Mapped: 174631890 (95.1% of input)
of these: 40873773 (23.4%) have multiple alignments (75889 have >20)
95.1% overall read alignment rate.

Aligned pairs: 168870512
of these: 24343928 (14.4%) have multiple alignments
and: 14517921 ( 8.6%) are discordant alignments
84.1% concordant pair alignment rate.

I'm happy with a 95% alignment of input, but I'm not sure where the multiple alignments come from nor whether the rate is acceptable.

The raw data comes from a RNA-seq so they should not be aligned to repeted sequences. I suppose the multiple alignments come from mismatches.

I can always discard this reads but I'd like to know if it's ok to keep them.

All comments are welcome. Thanks.

tophat align_summary multiple alignment • 9.6k views

ADD COMMENT • link •

modified 4.4 years ago by wukai199010 • 0 • written 4.4 years ago by Delong, Zhou • 140

Thanks for your reply.

It's assuring to know my data are good XD

I've run fastQC before tophat and it's very satisfying. The only reason I trimmed them was to be able to apply MISO (http://genes.mit.edu/burgelab/miso/docs/) afterwards.

Miso demands that all reads to be the same length. A part of my raw data contained reads of different sizes, so I had to trimmed everything to 50n so that I would not lose to many reads.

ADD REPLY • link written 4.4 years ago by Delong, Zhou • 140

4.4 years ago by

Jennifer Hillman Jackson ♦ 25k

United States

Jennifer Hillman Jackson ♦ 25k wrote:

Hello,

This report looks pretty good actually. Concordant pairs are sometimes higher, but that is dependent on how successful the sequencing was and how much you lost when you trimmed. You could try without trimming and compare - Tophat2 is pretty good at aligning the best part of a sequence and retaining the alignment, even if the ends are of slightly lower quality. The only trimming often necessary is for adaptor (assuming it is not most or all of the sequence - then it isn't worth it - the read will be discarded during mapping anyway). Running FastQC on your datasets, if you haven't yet, should give you an idea about the content and what to expect to lose or what modifications to consider.

For the multiple matches, that is normal. Remember there are psuedogenes and other valid genome duplications (the transcriptome has contained repeats/segmental duplicates). Anything not in a concordant pair will not be used in the downstream analysis - so it is safe to leave these in - no need to filter. Maximizing concordant pairs is your goal.

In general, the less done to the data the better, after basic QA for artifact, quality score scaling, and such. But you can test this as you go along and find the best path for your data.

Hopefully this helps! Jen, Galaxy team

ADD COMMENT • link modified 4.4 years ago • written 4.4 years ago by Jennifer Hillman Jackson ♦ 25k

4.4 years ago by

wukai199010 • 0

United States

wukai199010 • 0 wrote:

When I run my RNA-Seq data, it have high multiple alignments like "of these: 28066482 (93.3%) have multiple alignments (15218075 have >20)". So what's the mean for that?

And, it "59.6% concordant pair alignment rate".

ADD COMMENT • link written 4.4 years ago by wukai199010 • 0

93.3% multiple alignments? That's incredibly high for me..

What is your total number of reads? What is your quality score distribution? What is your % of alignment? What is your read length distribution?

One explaination I came with is that your reads are too short to have unique alignment. Another possiblity is that most your reads are not aligned so that the multiple alignment become high with that.

ADD REPLY • link modified 4.4 years ago • written 4.4 years ago by Delong, Zhou • 140

Please log in to add an answer.

Similar posts • Search »