Question: Fastq Collapse?
0
gravatar for Johnson, Kory (NIH/NINDS) [C]
7.0 years ago by
Hello Galaxy users, Just to follow-up on my user group question described in the list-serv e-mail just sent out. I put forth the question about FASTQ collapse, as the FASTX-toolkit by Assaf Gordon describes the supported collapse tool as follows: "FASTQ/A Collapser, Collapsing identical sequences in a FASTQ/A file into a single sequence (while maintaining reads counts)" Yet, the collapse tool in Galaxy appears to be FASTA supported only? Why am I asking? Would like to remove duplicate reads in a FASTQ file by sequence, leaving one representative unique read having the best quality line among the duplicates it was identified from. Can certainly convert FASTQ to FASTA, then collapse, but if you do not have the qual file, you cannot reconstitute a FASTQ file with actual qual scores. Any argument for or against? Or can Galaxy already do and I am missing the tool to actually use? Thanks ... best, Kory Kory R. Johnson, MS, PhD Sr. Bioinformatics Scientist www.kellygovernmentsolutions.com Providing Contract Services For: Bioinformatics Section, Information Technology & Bioinformatics Program, Division of Intramural Research (DIR), National Institute of Neurological Disorders & Stroke (NINDS), National Institutes of Health (NIH), Bethesda, Maryland Mailing Address: NINDS/NIH Clinical Center (Building 10) Office 5S223 9000 Rockville Pike Bethesda, MD 20892 Contact Information: Phone: 301-402-1956 Fax: 301-480-3563 email: johnsonko@ninds.nih.gov  Green Message: Please consider the environment before printing this e-mail. Thank you. Important Message: This electronic message transmission contains information intended for the recipient only. Such that, the information contained herein may be confidential, privaledged, or proprietary. If you are not the intended recipient, be aware that any disclosure, copying, distribution, or use of this information is strictly prohibited. If you have received this electronic information in error, please notify the sender immediately by telephone. Thank you. To: galaxy-user@lists.bx.psu.edu Subject: galaxy-user Digest, Vol 56, Issue 4 Send galaxy-user mailing list submissions to galaxy-user@lists.bx.psu.edu To subscribe or unsubscribe via the World Wide Web, visit http://lists.bx.psu.edu/listinfo/galaxy-user or, via email, send a message with subject or body 'help' to galaxy-user-request@lists.bx.psu.edu You can reach the person managing the list at galaxy-user-owner@lists.bx.psu.edu When replying, please edit your Subject line so it is more specific than "Re: Contents of galaxy-user digest..." HEY! This is important! If you reply to a thread in a digest, please 1. Change the subject of your response from "Galaxy-user Digest Vol ..." to the original subject for the thread. 2. Strip out everything else in the digest that is not part of the thread you are responding to. Why? 1. This will keep the subject meaningful. People will have some idea from the subject line if they should read it or not. 2. Not doing this greatly increases the number of emails that match search queries, but that aren't actually informative. Today's Topics: 1. CuffDiff gene fpkm tracking file. (Samuele Gherardi) 2. CuffDiff gene fpkm tracking file- Sorry! I sent only a part of my email (Samuele Gherardi) 3. Re: listing attributes of data input (Peter) 4. Re: CuffDiff gene fpkm tracking file. (Jeremy Goecks) 5. Re: Downloadable Galaxy Virtual Machine in VMware (Haarst, Jan van) 6. Re: Downloadable Galaxy Virtual Machine in VMware (Nate Coraor) 7. FASTQ collapse? (Johnson, Kory (NIH/NINDS) [C]) Message: 1 Date: Thu, 3 Feb 2011 09:53:44 +0000 To: "galaxy-user@lists.bx.psu.edu" <galaxy-user@lists.bx.psu.edu> Subject: [galaxy-user] CuffDiff gene fpkm tracking file. Message-ID: <025DB19130DE0B43BBD868BDB9244A82C134@E10-MBX3-DR.personale.di r.unibo.it> Content-Type: text/plain; charset="iso-8859-1" this is an example of my CuffDiff gene fpkm tracking file. tracking_id class_code nearest_ref_id gene_short_name tss_id locus q1_FPKM q1_conf_lo q1_conf_hi q2_FPKM q2_conf_lo q2_conf_hi XLOC_000001 - - MT-ND5 - chrM:0-16571 12484.2 12260.8 12707.7 11447 11233.1 11661 XLOC_000002 - - USP14 TSS1,TSS2,TSS3 chr18:148586-236453 16.7235 9.41244 24.0346 19.437 11.7368 27.1371 XLOC_000003 - - SMCHD1 TSS10,TSS11,TSS12,TSS4,TSS5,TSS6,TSS7,TSS8,TSS9 chr18:2719322-2728540 28.2493 17.5093 38.9892 27.2263 16.6263 37.8262 XLOC_000004 - - EMILIN2 TSS13,TSS14 chr18:2880607-2882469 3.98118 0 7.99721 4.62875 0.278519 8.97899 I this is normal, how can I find the class code of transcript listed in the CuffDiff gene expression file? thank you in advance Samuele. Message: 2 Date: Thu, 3 Feb 2011 10:58:47 +0000 To: "galaxy-user@lists.bx.psu.edu" <galaxy-user@lists.bx.psu.edu> Subject: [galaxy-user] CuffDiff gene fpkm tracking file- Sorry! I sent only a part of my email Message-ID: <025DB19130DE0B43BBD868BDB9244A82CB4C@E10-MBX3-DR.personale.di r.unibo.it> Content-Type: text/plain; charset="iso-8859-1" Hello everybody, I'm quite new in NGS world, I'm trying to analize dome RNA-seq data. I followed the workflow through tophat,cufflink,cuffcompare and cuffdiff I suppose everything work fine but in the Cuffdiff gene fpkm file the column Class_Code is empty and i don't know why? this is an example of my CuffDiff gene fpkm tracking file. tracking_id class_code nearest_ref_id gene_short_name tss_id locus q1_FPKM q1_conf_lo q1_conf_hi q2_FPKM q2_conf_lo q2_conf_hi XLOC_000001 - - MT-ND5 - chrM:0-16571 12484.2 12260.8 12707.7 11447 11233.1 11661 XLOC_000002 - - USP14 TSS1,TSS2,TSS3 chr18:148586-236453 16.7235 9.41244 24.0346 19.437 11.7368 27.1371 XLOC_000003 - - SMCHD1 TSS10,TSS11,TSS12,TSS4,TSS5,TSS6,TSS7,TSS8,TSS9 chr18:2719322-2728540 28.2493 17.5093 38.9892 27.2263 16.6263 37.8262 XLOC_000004 - - EMILIN2 TSS13,TSS14 chr18:2880607-2882469 3.98118 0 7.99721 4.62875 0.278519 8.97899 I this is normal, how can I find the class code of transcript listed in the CuffDiff gene expression file? thank you in advance Samuele. Message: 3 Date: Thu, 3 Feb 2011 11:05:07 +0000 To: Freddy de Bree <freddy.debree@wur.nl> Cc: galaxy-user@lists.bx.psu.edu Subject: Re: [galaxy-user] listing attributes of data input Message-ID: <aanlktimkxwr_9mfudu7ws+qaphtfrqdj+rrukrltstfx@mail.gmail.com> Content-Type: text/plain; charset=ISO-8859-1 Some are given in examples on the main tool XML doc page, https://bitbucket.org/galaxy/galaxy-central/wiki/ToolConfigSyntax Others I've noticed by looking at the provided XML wrappers, and/or email list questions. For example, .ext or .extension gives the Galaxy file type (e.g. fasta). Other than that, I guess you can always read the code - but I agree that a document describing this would be nice to have. Peter Message: 4 Date: Thu, 3 Feb 2011 09:14:19 -0500 To: Samuele Gherardi <samuele.gherardi@unibo.it> Cc: "galaxy-user@lists.bx.psu.edu" <galaxy-user@lists.bx.psu.edu> Subject: Re: [galaxy-user] CuffDiff gene fpkm tracking file. Message-ID: <b60dbe14-31bd-432d-92bb-fb7c8363a86f@emory.edu> Content-Type: text/plain; charset=us-ascii Hi Samuele, Without seeing your history, it's difficult to say for certain what your problem is. However, I'd guess that the GTF file that you're providing to Cuffdiff does not have the p_id attribute. You can produce a GTF file with both tss_id and p_id attributes by running Cuffcompare and using sequence data. Thanks, J. Message: 5 Date: Thu, 3 Feb 2011 16:54:14 +0100 To: "'Leon Mei'" <hailiang.mei@nbic.nl>, "'galaxy-user@lists.bx.psu.edu'" <galaxy- user@lists.bx.psu.edu=""> Cc: 'David van Enckevort' <david.van.enckevort@nbic.nl>, 'Rob Hooft' <rob.hooft@nbic.nl> Subject: Re: [galaxy-user] Downloadable Galaxy Virtual Machine in VMware Message-ID: <48B2C6E110F6CC4387AFCD2EBCCA0B3B25D12D432F@scomp0536.wurnet.nl> Content-Type: text/plain; charset="iso-8859-1" The download can also be done using bittorrent, torrent is available at http://www.biotorrents.net/details.php?id=136 . This might be faster, as one of the peers is in Canada. With kind regards, Jan Message: 6 Date: Thu, 3 Feb 2011 11:37:01 -0500 To: "Haarst, Jan van" <jan.vanhaarst@wur.nl> Cc: "'galaxy-user@lists.bx.psu.edu'" <galaxy-user@lists.bx.psu.edu>, 'Leon Mei' <hailiang.mei@nbic.nl>, 'David van Enckevort' <david.van.enckevort@nbic.nl>, 'Rob Hooft' <rob.hooft@nbic.nl> Subject: Re: [galaxy-user] Downloadable Galaxy Virtual Machine in VMware Message-ID: <20110203163701.GE15147@bx.psu.edu> Content-Type: text/plain; charset=iso-8859-1 This is great! I haven't checked the image out, but I'm fetching the torrent now and will leave it seeding here from PSU to help out. Thanks, --nate Message: 7 Date: Thu, 3 Feb 2011 12:51:34 -0500 To: "'galaxy-user@bx.psu.edu'" <galaxy-user@bx.psu.edu> Subject: [galaxy-user] FASTQ collapse? Message-ID: <f142c51c02c33c418e931103600d1e670648e93762@nihmlbxbb03.nih.gov> Content-Type: text/plain; charset="us-ascii" Hello, Is there an option to collapse duplicate sequences in FASTQ format. I see collapse for FASTA, but where is it for FASTQ? Thank you, Kory Kory R. Johnson, MS, PhD Sr. Bioinformatics Scientist [cid:image001.jpg@01CBC39E.D7F751F0] www.kellygovernmentsolutions.com Providing Contract Services For: Bioinformatics Section, Information Technology & Bioinformatics Program, Division of Intramural Research (DIR), National Institute of Neurological Disorders & Stroke (NINDS), National Institutes of Health (NIH), Bethesda, Maryland Mailing Address: NINDS/NIH Clinical Center (Building 10) Office 5S223 9000 Rockville Pike Bethesda, MD 20892 Contact Information: Phone: 301-402-1956 Fax: 301-480-3563 email: johnsonko@ninds.nih.gov P Green Message: Please consider the environment before printing this e-mail. Thank you. Important Message: This electronic message transmission contains information intended for the recipient only. Such that, the information contained herein may be confidential, privaledged, or proprietary. If you are not the intended recipient, be aware that any disclosure, copying, distribution, or use of this information is strictly prohibited. If you have received this electronic information in error, please notify the sender immediately by telephone. Thank you. Name: image001.jpg Type: image/jpeg Size: 2396 bytes Desc: image001.jpg URL: <http: lists.bx.psu.edu="" pipermail="" galaxy-="" user="" attachments="" 20110203="" 17864960="" attachment.jpg=""> _______________________________________________ galaxy-user mailing list galaxy-user@lists.bx.psu.edu http://lists.bx.psu.edu/listinfo/galaxy-user End of galaxy-user Digest, Vol 56, Issue 4 ******************************************
galaxy • 2.1k views
ADD COMMENTlink modified 7.0 years ago by Ben Bimber20 • written 7.0 years ago by Johnson, Kory (NIH/NINDS) [C]50
0
gravatar for Ben Bimber
7.0 years ago by
Ben Bimber20
Ben Bimber20 wrote:
i dont have any intimate knowledge here, but my guess is that it comes down to defining what quality scores to keep. collapsing sequences is easy. the sequences either they match or they dont. handling sequence+quality is harder. would you keep the quality string with the highest total sum of qualities? what if 2 have an identical sum, but different strings (which is probably not uncommon)? in theory you could create a completely new quality string that attempts to gather average quality based on the quality score at each position. all of these are possible, but it starts becoming less transparent and more complex. collapsing FASTQ to FASTA is simply a way to remove that problem. a similar scenario I could imagine sometimes being useful would be 'collapse sequences, but don't count Ns as mismatches'. certainly possible, but more complicated than simply collapsing reads that are 100% sequence-identical. once again, just my own thoughts here. Assaf or someone from galaxy could perhaps answer better. -ben On Thu, Feb 3, 2011 at 12:22 PM, Johnson, Kory (NIH/NINDS) [C]
ADD COMMENTlink written 7.0 years ago by Ben Bimber20
Hi Ben, Thanks for your reply! Yes, perhaps taking average, median, max quality value by position across duplicate reads, using sliding window function, or simply randomly selecting one quality line from those having the highest quality summed value is the way to go. The complexity of problem I would think can be greatly reduced if "FASTQ collapse" is done post FASTQ filtering by quality and/or length. It would also be an interesting question to ask::answer what the numbers look like post filtering as far as duplicate sequences having the same quality string vs differences by position vs base composition. Perhaps a FASTQ collapse tool could be developed in such a way to not only remove duplicates and replace with a representative quality line, but also be used as a way to perform a FASTQ filtering polish step based on discordance rates by position across duplicates. Such that, a user can look at the discordance rates via box plot as you would/can for quality scores observed across all reads, and pick a refined criteria for filtering. Just an idea. Thanks again for taking the time to respond Ben. Best, Kory Kory R. Johnson, MS, PhD Sr. Bioinformatics Scientist www.kellygovernmentsolutions.com Providing Contract Services For: Bioinformatics Section, Information Technology & Bioinformatics Program, Division of Intramural Research (DIR), National Institute of Neurological Disorders & Stroke (NINDS), National Institutes of Health (NIH), Bethesda, Maryland Mailing Address: NINDS/NIH Clinical Center (Building 10) Office 5S223 9000 Rockville Pike Bethesda, MD 20892 Contact Information: Phone: 301-402-1956 Fax: 301-480-3563 email: johnsonko@ninds.nih.gov  Green Message: Please consider the environment before printing this e-mail. Thank you. Important Message: This electronic message transmission contains information intended for the recipient only. Such that, the information contained herein may be confidential, privaledged, or proprietary. If you are not the intended recipient, be aware that any disclosure, copying, distribution, or use of this information is strictly prohibited. If you have received this electronic information in error, please notify the sender immediately by telephone. Thank you. To: Johnson, Kory (NIH/NINDS) [C]; galaxy-user@lists.bx.psu.edu Subject: Re: [galaxy-user] FASTQ collapse? i dont have any intimate knowledge here, but my guess is that it comes down to defining what quality scores to keep. collapsing sequences is easy. the sequences either they match or they dont. handling sequence+quality is harder. would you keep the quality string with the highest total sum of qualities? what if 2 have an identical sum, but different strings (which is probably not uncommon)? in theory you could create a completely new quality string that attempts to gather average quality based on the quality score at each position. all of these are possible, but it starts becoming less transparent and more complex. collapsing FASTQ to FASTA is simply a way to remove that problem. a similar scenario I could imagine sometimes being useful would be 'collapse sequences, but don't count Ns as mismatches'. certainly possible, but more complicated than simply collapsing reads that are 100% sequence-identical. once again, just my own thoughts here. Assaf or someone from galaxy could perhaps answer better. -ben On Thu, Feb 3, 2011 at 12:22 PM, Johnson, Kory (NIH/NINDS) [C]
ADD REPLYlink written 7.0 years ago by Johnson, Kory (NIH/NINDS) [C]50
i'm not terribly familiar with what galaxy offers along these lines, but google 'fastqc' or picard tools for a simple way to find the sort of quality distributions you just mentioned. it would take a little work on your end, but you could answer that questions. the former actually uses picard tools behind the scenes, but is a little more graphically oriented. -ben On Thu, Feb 3, 2011 at 1:23 PM, Johnson, Kory (NIH/NINDS) [C]
ADD REPLYlink written 7.0 years ago by Ben Bimber20
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 45 users visited in the last hour