Question: Need help with "FASTQ splitter" tool
0
gravatar for pfaucon
4.1 years ago by
pfaucon0
United States
pfaucon0 wrote:

I was having difficulty with the FASTQ splitter, it was not finding my fastq file (an SRA-based FASTQ file from GEO) to be split.  After changing the type from FASTQ to FASTQSanger it seemed to work fine (illumina didn't work, I didn't try others).  Is this an expected behavior? Does the tool only function for Sanger formatted FASTQ?


 

Tool name: FASTQ splitter
Tool version: 1.0.0
Tool ID: toolshed.g2.bx.psu.edu/repos/devteam/fastq_paired_end_splitter/fastq_paired_end_splitter/1.0.0
ToolShed URL: https://toolshed.g2.bx.psu.edu/view/devteam/fastq_paired_end_splitter

 

fastq-splitter • 1.3k views
ADD COMMENTlink modified 4.1 years ago by Jennifer Hillman Jackson25k • written 4.1 years ago by pfaucon0
0
gravatar for Jennifer Hillman Jackson
4.1 years ago by
United States
Jennifer Hillman Jackson25k wrote:

Hello,

Yes, the datatype .fastqsanger is needed for most tools that work with sequence manipulation tools. And often downstream tools (such as those used for mapping).

I would suggested backing up a bit and doing a double check that you have the correct datatype assignment versus the quality score scaling. These two datatypes (.fastqillumina & .fastqsanger) in particular are different. Finding out that scaling was mixed up is fairly frustrating, depending on how far you have gone in the analysis, and it will certainly impact the usability of the results. Starting over is never fun.

This section of our wiki has many details for basic QA and the video shows exactly what to do step-by-step.
http://wiki.galaxyproject.org/Support#FASTQ_Datatype_QA

Take care, Jen, Galaxy team

ADD COMMENTlink written 4.1 years ago by Jennifer Hillman Jackson25k

Jen,

Based on this wiki page : http://en.wikipedia.org/wiki/FASTQ_format#Encoding

I believe I have an illumina 1.8 sequence (the quality scores include "*" [Sanger or Illumina 1.8+], and also "J" [not Sanger]).  I'm new to bioinformatics but based on the encoding values I still have no idea what I should be using.  I'm guessing because of the encoding similarity I should select fastqsanger (since other encodings of illumina seem quite far away) ? 

As an aside is there any reason that tools require fastqsanger?  I haven't seen the others but it seems that at the least the "FASTQ Splitter" should not care about the encoding of the quality scores, or am I missing something else?

Thanks!

ADD REPLYlink written 4.1 years ago by pfaucon0

Hi, You may have checked out the link I sent you already, but if not here it is again:
http://wiki.galaxyproject.org/Support#FASTQ_Datatype_QA

This describes how to test a dataset, what tools/methods to run to transform or label data, and the like. 

Tools require .fastqsanger format as a way of normalizing the data before any analysis is started. You wouldn't be able to filter for quality then use the same exact sequence for splitting (this uses a sequence matching algorithm) and later genome mapping, if it wasn't already set to a standardized quality score.

One other benefit of doing data prep at the very start: I have the opinion that it is much less tedious to transform a few large datasets first, rather than many, many split datasets later on. The initial QA/QC time is worth it. Workflows can help of course, as with any pipeline you decide to create in Galaxy, but perhaps this makes sense to you, too. 

Tool usage line-command for some tools can vary (accept different fastq formats, including color-space), but especially for a public server, using some type of input standardization is quite common (and not just for bioinformatics tools/applications). Hosting every version of every tool is not practical. So, dataset format to specification and consistent content are the primary components you want to focus on at first, to meet tool input requirements. This helps to avoid issues with upstream tools and tools much further downstream. Starting over completely - a frustrating place to be (I know!).

Hope the wiki and video help! Jen, Galaxy team

ADD REPLYlink modified 4.1 years ago • written 4.1 years ago by Jennifer Hillman Jackson25k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 169 users visited in the last hour