Question: quality score length differs from sequence length
3.3 years ago by
eloise.greenland0 wrote:

I have a fastq file that tophat2 can't process because one of the reads seems to have a sequence length far shorter than the corresponding quality scores (154 for sequence, 434 for quality scores). The entry with the mismatch doesn't seem to have a name so I can't find it and get rid of it or mask it from the analysis that way. Is there any way that I can instruct tophat2 to ignore any reads where this kind of mismatch occurs? Or some other solution?


3.3 years ago by
United States
Jennifer Hillman Jackson25k wrote:


First check the end of your file with the tool "Text manipulation: Select last lines from a dataset". Usually this problem is due to a truncated dataset upload and the problem will be in the last few lines. You will almost certainly want to reload the data. But if you do want to remove this line and keep the rest for some reason, there are other line manipulation functions in this same tool group.

This could be more complicated if you uploaded files, merged them, etc. And the truncated data could come from an transfer that is upstream of Galaxy.

Tools like "Fastq Groomer" will report problems by line numbers plus print out the contents of those lines. So that is also an option for finding where problems are within a dataset.

More troubleshooting help:

Thanks, Jen, Galaxy team

Hi Jen,

Thanks for your suggestions. I checked the last lines of the file and it doesn't appear truncated. I get the same problem using tophat2 locally on the command line. I haven't merged the data or anything, it is the raw sequencing file I received from our sequencing provider.

I did try the fastq groomer in galaxy and it failed to execute on this set of reads. I deleted it from my history when things were getting a bit messy. With any luck I haven't purged it yet and I can go back and check the info from the failed run. Otherwise I will run it again and get back to you.



