tophat result interpretation

Question: tophat result interpretation

2.5 years ago by

1603.neha • 70 wrote:

i have downloaded an SRA sequence from ebi. after doing QC and grooming i have run tophat. i have got the result Reads: Input : 180318237 Mapped : 107042965 (59.4% of input) of these: 9998307 ( 9.3%) have multiple alignments (20224 have >20) 59.4% overall read mapping rate.

is it good enough to carry out further analysis??cufflink and cuffdiff?? thanks

rna-seq tophat • 1.5k views

ADD COMMENT • link •

modified 2.5 years ago by y.hoogstrate • 460 • written 2.5 years ago by 1603.neha • 70

2.5 years ago by

Jennifer Hillman Jackson ♦ 25k

United States

Jennifer Hillman Jackson ♦ 25k wrote:

Hello,

The numbers look a bit low. It could be a known data quality issue. Run FastQC to check first. If very poor, you could check with the data authors to see if a known.

Overzealous trimming can also lead to poor mapping by eroding the sequence content. Perhaps try mapping a sample without any QA. Then add some QA back in and test to see which QA increases/decreases alignment rates.

But, this could also be because of a datatype assignment issue (I see this quite often). Fastq data must have quality scores scaled appropriately and be assigned the datatype "fastqsanger". Many times when this comes up, the input type to the Fastq Groomer tool was not a match for the data or fastqsanger was directly assigned to another fastq type (such as fastqilluminia). Here is how to check your sequences and assign the correct type/scaling as needed: https://wiki.galaxyproject.org/Support#FASTQ_Datatype_QA

And finally, if the sequences are less than 50 bases long, there is a specific parameter that can be modified in Tophat to remove bias and potentially help with alignment rates. The value should be at least one half of the length of the shortest sequence. (Yet beware about dropping this too low - instead just accept that any very short sequences not twice this value may not map and factor that in when reviewing mapping rates).

TopHat settings to use > Full parameter list > Minimum length of read segments

Best, Jen, Galaxy team

ADD COMMENT • link written 2.5 years ago by Jennifer Hillman Jackson ♦ 25k

2.5 years ago by

y.hoogstrate • 460

Netherlands

y.hoogstrate • 460 wrote:

Jen is right, 60% is quite low. I would expect ~85% / 90% to be mapped correctly (hg19). I've also seen that aligning to hg18 gives a slightly lower and aligning to hg38 gives a slightly higher percentage. As the human reference genome is pretty well known it makes sense to me that if you're using a reference genome of another species it may be lower, although I think 60% is still pretty low. Please double check the supplementary data of the corresponding article. That usually contains a section about the protocol and may explain certain contaminants or other important characteristics about the data (like adapters/linkers or tags).

I would nevertheless try to figure out what's causing this. As Jen says it may have to do with sequence quality (you can figure this out pretty easily by using FastQC in galaxy) but also adapter sequences may contaminate your data. The majority of those sequences are part of FastQC but I've also seen a few that are not, so don't blindly rely on this. I've also seen datasets that use 'tags', kinda random prefixes/suffixes to the sequences that are just there to tag a sequence to a certain dataset or for some other experimental evidence. I think in RSeQC or picard can make softclip distribution plots, that may indicate such things (don't know the name of this tool by heart though) but this is not a perfect soltion and they should be removed before aligning.

What I would really recommend you to do is check the overreprensted sequences section in the FastQC. Just blat the top ones and see if you see anything that doesn't fit the species you're sequencing. It may be adapters or maybe even actually detected RNA from some source of contamination. You can do something similar with high quality reads that were kept in the unmappped reads of tophat. Also take a careful look at the sequences because sequences like TTTTTTTTTTTTTTTTTTTTT or AAAAAAAAAAAAAAA don't make too much sense but do appear in RNA-Seq data from time to time.

Good luck,

Youri

ADD COMMENT • link modified 2.5 years ago • written 2.5 years ago by y.hoogstrate • 460

Please log in to add an answer.

Similar posts • Search »