Question: Number Of Mismatches Allowed In The Initial Read Mapping
0
gravatar for Du, Jianguang
6.2 years ago by
Du, Jianguang380
Du, Jianguang380 wrote:
Dear All, I tested how to set the "Number of mismatches allowed in the initial read mapping" as follows. At first, I ran FASTQ Groomer on a dataset to get the number of total reads. The total number of the reads is 17510227. Then I ran Tophat after set "Number of mismatches allowed in the initial read mapping" as 1, and then ran "flagstat" under "NGS: SAM Tools". Here is the statistic information of Thophat output: 18162942 + 0 in total (QC-passed reads + QC-failed reads) 0 + 0 duplicates 18162942 + 0 mapped (100.00%:-nan%) 0 + 0 paired in sequencing 0 + 0 read1 0 + 0 read2 0 + 0 properly paired (-nan%:-nan%) 0 + 0 with itself and mate mapped 0 + 0 singletons (-nan%:-nan%) 0 + 0 with mate mapped to a different chr 0 + 0 with mate mapped to a different chr (mapQ>=5) Next I ran Tophat after set "Number of mismatches allowed in the initial read mapping" as 0, and then ran "flagstat" under "NGS: SAM Tools". Here is the statistic information of Thophat output: 16100027 + 0 in total (QC-passed reads + QC-failed reads) 0 + 0 duplicates 16100027 + 0 mapped (100.00%:-nan%) 0 + 0 paired in sequencing 0 + 0 read1 0 + 0 read2 0 + 0 properly paired (-nan%:-nan%) 0 + 0 with itself and mate mapped 0 + 0 singletons (-nan%:-nan%) 0 + 0 with mate mapped to a different chr 0 + 0 with mate mapped to a different chr (mapQ>=5) Does it mean about 0.6 million reads are aligned for 2 times or more after I set "Number of mismatches allowed in the initial read mapping" as 1, however about 1.4 million reads can not be aligned because of more stringent setting? Which number should we choose? Thanks. Jianguang
rna-seq tophat • 1.7k views
ADD COMMENTlink modified 6.2 years ago by Jennifer Hillman Jackson25k • written 6.2 years ago by Du, Jianguang380
0
gravatar for Jennifer Hillman Jackson
6.2 years ago by
United States
Jennifer Hillman Jackson25k wrote:
Hello Jianguang, This is in reply to this email and your prior email from yesterday 9/6 subject "Tophat settings". The testing here was a very good way to see how parameters impact mapping. In addition, see below ... Maybe, but I don't think it is that simple, nor something that is important for the final result. What this really means in the end is that more reads were permitted to be mapped because the criteria was less stringent. A mismatch of 1 was allowed in the initial step, so more reads were available to meet the other mapping criteria. More of these passed the downsteam mapping criteria than in the other dataset and were eventually included in the output. This is why the number of mapped reads is higher. Yes, if you use more stringent criteria (with any mapping tool, not just TopHat), less of your data will map. A mismatch of 0 is an exact match, which is maximum stringency, so less reads met the initial mapping criteria, removing them from downstream evaluation by other mapping criteria. Then, less of them passed this downstream mapping criteria than in the other dataset and less were included in the output. This is why the number of mapped reads is lower. If the other mapping criteria for both runs was the same, and the only variable change was this one, then a reasonable way to explain these results would be to state something like: the initial mapping with mismatch 0 filtered out sequences that would have otherwise mapped if a mismatch 1 were used instead. This is something that you will need to decide. There are likely many ways to analyze this further, but sometimes just actually looking at some of the data in browser can provide a lot of information that statistics cannot. Pick a few favorite (complex and simple) gene bounds with spliced transcripts, add in your mapping results, put the data into a browser (Trackster, UCSC, etc.) and see which make the most sense for your particular experiment, dataset, and genome. (There are no hard rules around this). I don't mean to push you towards another list again, but want you to get the answers you need. If you really do have serious concerns about how the TopHat mapping algorithm itself is functioning, or suspect a problem, the tool authors and the mailing list dedicated to this exact topic is really the best resource to discuss the finer details. tophat.cufflinks@gmail.com Best wishes for your project, Jen Galaxy team -- Jennifer Jackson http://galaxyproject.org
ADD COMMENTlink written 6.2 years ago by Jennifer Hillman Jackson25k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 172 users visited in the last hour