Question: Tophat2 align_summary interpretation
2
gravatar for Delong, Zhou
3.5 years ago by
Delong, Zhou140
Canada
Delong, Zhou140 wrote:

Hello,

I'm examining my tophat output data and I would like to have some opinions on my align_summary.txt.

For example for one alignement I have this summary for a RNA-seq alignment. The quality score is good, and I trimmed to have fixed read lenth at 50n.

 

Left reads:
               Input: 183588390
              Mapped: 174579844 (95.1% of input)
            of these:  36018754 (20.6%) have multiple alignments (48483 have >20)
Right reads:
               Input: 183588390
              Mapped: 174631890 (95.1% of input)
            of these:  40873773 (23.4%) have multiple alignments (75889 have >20)
95.1% overall read alignment rate.

Aligned pairs: 168870512
     of these:  24343928 (14.4%) have multiple alignments
          and:  14517921 ( 8.6%) are discordant alignments
84.1% concordant pair alignment rate.

 

 

I'm happy with a 95% alignment of input, but I'm not sure where the multiple alignments come from nor whether the rate is acceptable.

The raw data comes from a RNA-seq so they should not be aligned to repeted sequences. I suppose the multiple alignments come from mismatches.

I can always discard this reads but I'd like to know if it's ok to keep them.

 

All comments are welcome. Thanks.

ADD COMMENTlink modified 3.4 years ago by wukai1990100 • written 3.5 years ago by Delong, Zhou140

Thanks for your reply.

It's assuring to know my data are good XD

 

I've run fastQC before tophat and it's very satisfying. The only reason I trimmed them was to be able to apply MISO (http://genes.mit.edu/burgelab/miso/docs/) afterwards.

Miso demands that all reads to be the same length. A part of my raw data contained reads of different sizes, so I had to trimmed everything to 50n so that I would not lose to many reads.

 

 

ADD REPLYlink written 3.5 years ago by Delong, Zhou140
5
gravatar for Jennifer Hillman Jackson
3.5 years ago by
United States
Jennifer Hillman Jackson23k wrote:

Hello,

This report looks pretty good actually. Concordant pairs are sometimes higher, but that is dependent on how successful the sequencing was and how much you lost when you trimmed. You could try without trimming and compare - Tophat2 is pretty good at aligning the best part of a sequence and retaining the alignment, even if the ends are of slightly lower quality. The only trimming often necessary is for adaptor (assuming it is not most or all of the sequence - then it isn't worth it - the read will be discarded during mapping anyway). Running FastQC on your datasets, if you haven't yet, should give you an idea about the content and what to expect to lose or what modifications to consider.

For the multiple matches, that is normal. Remember there are psuedogenes and other valid genome duplications (the transcriptome has contained repeats/segmental duplicates). Anything not in a concordant pair will not be used in the downstream analysis - so it is safe to leave these in - no need to filter. Maximizing concordant pairs is your goal.

In general, the less done to the data the better, after basic QA for artifact, quality score scaling, and such. But you can test this as you go along and find the best path for your data.

Hopefully this helps! Jen, Galaxy team

ADD COMMENTlink modified 3.5 years ago • written 3.5 years ago by Jennifer Hillman Jackson23k
0
gravatar for wukai199010
3.4 years ago by
United States
wukai1990100 wrote:

When I run my RNA-Seq data, it have  high multiple alignments like "of these:  28066482 (93.3%) have multiple alignments (15218075 have >20)". So what's the  mean for that?

And, it "59.6% concordant pair alignment rate".

 

ADD COMMENTlink written 3.4 years ago by wukai1990100
1

93.3% multiple alignments? That's incredibly high for me..

What is your total number of reads? What is your quality score distribution? What is your % of alignment? What is your read length distribution?

 

One explaination I came with is that your reads are too short to have unique alignment. Another possiblity is that most your reads are not aligned so that the multiple alignment become high with that.

ADD REPLYlink modified 3.4 years ago • written 3.4 years ago by Delong, Zhou140
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 85 users visited in the last hour