Question: Patch For Better Fastq Description Handling
0
gravatar for Florent Angly
7.2 years ago by
Florent Angly370
Florent Angly370 wrote:
Hi, I have found some issue with the way FASTQ read description is handled by Galaxy utilities: https://bitbucket.org/galaxy/galaxy-central/issue/665/paired-end-code- mishandles-description-of Please consider pulling my patch, thanks, Florent
galaxy • 894 views
ADD COMMENTlink modified 7.1 years ago • written 7.2 years ago by Florent Angly370
0
gravatar for Florent Angly
7.1 years ago by
Florent Angly370
Florent Angly370 wrote:
I have had the chance to try the patch on several datasets and it looks good :) I reiterate my suggestion to pull the patch in galaxy-central. Best, Florent
ADD COMMENTlink written 7.1 years ago by Florent Angly370
It looks sensible, although I would add a comment to the join method to say the description from read1 is taken (if the reads differ in their descriptions). Mind you, the whole module seems to lack docstrings ;) Are there any unit tests (not that Galaxy seems to insist on them)? Peter
ADD REPLYlink written 7.1 years ago by Peter Cock1.4k
Hi Florent, Sorry for the delay. I did try the patch out shortly after you contributed it, but it caused the functional to fail. I was able to fix the issue and allow the existing tests to start passing, but I've been bogged down lately and haven't been able to perform a more thorough review of the code. If you could provide tests with files (e.g. for the tools affected) that test the new functionality, that would be a great help. The use of partition removes python compatibility for <2.5, although this is a lesser/non-concern. Also, I'm not entirely sold on having the "Identifier line" being parsed as "identifier" + <space> + "description" instead a single identifier line. This would mean that identifiers could not themselves contain spaces, but "There is no standardization for identifiers" (so they could technically have spaces?). Could two different reads be identified as "Read A" and "Read B", but then would no longer be uniquely identifiable as each would then be identified as "Read". If this added functionalilty were introduced as optional behavior (e.g. a user needs to click a checkbox on the tools to apply the id line splitting), these concerns can be mitigated. Peter, Florent, anyone else: I'd be very interested to hear your thoughts on the above, particularly in respect to know real-world data. For now, lets discount SRA data from this discussion. Thanks, Dan
ADD REPLYlink written 7.1 years ago by Daniel Blankenberg ♦♦ 1.7k
I guess you could use split, but special case on there being no space. That is the normal convention, just like with FASTA. http://dx.doi.org/10.1093/nar/gkp1137 That is expected, "@Read A" and "@Read B" have the same identifier, "Read". See also the new Illumina 1.8 naming convention where they dropped the /1 and /2 and hit it in the description. It should be tested, but I think Florent's patch will work here (while the current Galaxy behaviour won't). Peter
ADD REPLYlink written 7.1 years ago by Peter Cock1.4k
Peter and Daniel, thanks for the comments. I'll have a look at that. The Bioperl and Biopython projects use this convention for FASTA and FASTQ files. I was not aware of this new naming. It seems like a terrible decision from Illumina because now both reads in a pair technically have the same ID (but a different description). Florent
ADD REPLYlink written 7.1 years ago by Florent Angly370
This is not quite the case. Here are two fastq header lines for a pair of reads produced by Illumina's CASAVA 1.8: @XYZZY:123:D0ABCDEFG:7:1101:1445:2057 1:N:0:CTTGTA @XYZZY:123:D0ABCDEFG:7:1101:1445:2057 2:N:0:CTTGTA The two key things to note, relevant to this discussion are: 1. A space character is used to split the fields into two groups. This is actually a good thing, because that particular character can NEVER appear in either a sequence or a quality line. This make it easy to detect name lines as those beginning with "@" (a valid quality character) and also having a space. If you are writing a parser for the new Illumina fastq format, please don't break the names on spaces! 2. Appart from the read number, encoded as the digit immediately following the space, the two lines are identical--as they were with earlier CASAVA versions. Why is this worse than two lines differing by "/1" vs. "/2"? An additional improvement with the new naming convention is that flowcell and run ID's, as well as a flag for not passing filters (where N means does PF), are now included. Eric L. Cabot Biotechnology Center University of Wisconsin-Madison
ADD REPLYlink written 7.1 years ago by Eric Cabot20
Yes, Illumina gives both read 1 and read 2 the same template ID of XYZZY:123:D0ABCDEFG:7:1101:1445:2057 (much like the two reads would have the same ID in a SAM/BAM file). Yes, you could use the space as a sanity test for *this* style Illumina FASTQ, and have a bespoke parser which treats this all specially. But for a generic FASTQ parser you *should* split at the space. The point is Illumina have changed the meaning of their FASTQ identifier, it used to be the template ID plus a /1 or /2 suffix, but now it is just the common template ID used for both parts. Because it is a change from the existing well established convention, which will require changed to hundreds of scripts and and tools (guessed number including user's bespoke scripts). Yes, that is good. Peter
ADD REPLYlink written 7.1 years ago by Peter Cock1.4k
Hi Dan, I finally addressed your comments. See the updated pull request: https://bitbucket.org/galaxy/galaxy-central/pull-request/8/paired-end- code-mishandles-description-of Best, Florent
ADD REPLYlink written 7.0 years ago by Florent Angly370
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 178 users visited in the last hour