Question: fastq.gz data - help for changes that impact upload (any source), history contents, and tool usage
0
gravatar for mvinas
14 months ago by
mvinas20
mvinas20 wrote:

Hello,

I am downloading the data from SRA to Galaxy directly, but it seems Galaxy can not unzipped the files automatically. I did it many times before and it was working, but not now. There is another way how to unzipped the files?

Thanks, Maria

ADD COMMENTlink modified 14 months ago • written 14 months ago by mvinas20
1
gravatar for jling
14 months ago by
jling60
United States
jling60 wrote:

Make sure the 'Datatype' is set to fastq.gz

You can then 'Convert Format' from fastq.gz to fastq. That's how I've been unzipping

ADD COMMENTlink written 14 months ago by jling60

Thanks jling! I added in more details to support your reply, since the change is new and this post can be a resource for sharing full help/advice until the implementation details are finalized.

ADD REPLYlink written 14 months ago by Jennifer Hillman Jackson24k
1
gravatar for Jennifer Hillman Jackson
14 months ago by
United States
Jennifer Hillman Jackson24k wrote:

Hello,

Uploaded gz compressed FASTQ data now loads in compressed format into the History. Tools will now accept compressed formatted datasets as input. This saves space in your account - which is a priority for many having larger sized data/experiments to analyze. As before, some tools accept fastq datatypes (example: prep/QA steps/tools) and others accept fastqsanger datatypes (example: mapping and downstream analysis steps/tools).

Quick help for using gz compressed input data:

  • If the tool accepts fastq input, then gz compressed data assigned the datatype fastq.gz is appropriate.
  • If the tool accepts fastqsanger input, then gz compressed data assigned the datatype fastqsanger.gz is appropriate.
  • There is still the option to uncompress data and use that with tools, if you wish (as described in jling's informative reply). Usage would be the same as before the change.
  • Very important: Avoid labeling compressed data with an uncompressed datatype, and the reverse.

Before assigning fastqsanger or fastqsanger.gz, be sure to confirm the format. Using non-fastqsanger scaled quality values will cause scientific problems with tools that expected fastqsanger formatted input, even if the tool does not fail, the same as tools have always worked at http://usegalaxy.org. This is how to check format: https://wiki.galaxyproject.org/Support#FASTQ_Datatype_QA

This "compressed data" change is brand new and may undergo some revisions before it is finalized for the upcoming 17.01 Galaxy release. Details will be documented and learn/tutorial/support help will be updated to reflect the final usage guidelines.

Feedback about how this is working in practice by our community of users (odd behavior, tool failures) are welcome as bug reports if anything remains unclear or tool problems come up. We want to review these, address problems/tool conflicts that potentially exist, do our best to provide exact help for your specific data analysis use case, and incorporate commonly reported problematic/confusing use-cases into our expanded help topics and while also considering the feedback while tuning the final implementation of this enhancement.

Thanks for the question! Others will hopefully benefit from your description of the problem and the available usage advice (from all) posted back here. Jen, Galaxy team

ADD COMMENTlink modified 14 months ago • written 14 months ago by Jennifer Hillman Jackson24k

Update: Published support help for same topic https://galaxyproject.org/support/compressed-fastq/

ADD REPLYlink written 12 months ago by Jennifer Hillman Jackson24k
0
gravatar for mvinas
14 months ago by
mvinas20
mvinas20 wrote:

Thank you Jen,

In my case, FASTQ Groomer did not recognize fastq.gz data when is a "dataset collection", it recognizes only "single" fastq.gz data.

So, I had to uncompress the data but, because is very large, TopHat is taking more than 2-3 days, Is this normal? Before was around 2-3 hours.

Thanks, Maria

ADD COMMENTlink written 14 months ago by mvinas20

Hello Maria,

Yes, this is the behavior right now. Final implementation for exactly how to best incorporate/handle compressed data in collections is still in development. Expect progress on this soon as it is a priority task for our team & development contributors.

Jobs are queuing a bit longer that usual right now - and I am guessing that is the 2-3 day wait you are referring to. Allow any in the grey/queued states to remain that way so they will retain their place in the server queue execute (deleting/rerunning will place jobs back at then end of this queue, further extending wait time). Execution time (once the dataset becomes yellow) can vary depending on the size of the inputs (are processed uncompressed internally by tools), the size of the target genome, whether the genome has pre-computed indexes versus being a Custom genome that requires that step during processing, the job parameters used, and other factors such as quality of the NGS reads and the quality/resolution status of the reference genome.

If a job does fail for exceeding wall-time (execution time) or memory resources, then you will need to take some action for a successful job. This may occur during mapping with such a large data or later when using downstream tools. Section 2.8 of the Galaxy support wiki explains alternatives for working with data/jobs that exceed the compute resources at http://usegalaxy.org: https://wiki.galaxyproject.org/Support

Hope this helps! Jen

ADD REPLYlink modified 14 months ago • written 14 months ago by Jennifer Hillman Jackson24k
1

Many thanks Jen, I am "Grooming" each data first and then prepare the "dataset collection" and it worked. I did not have to uncompressed the .fastq.gz files and now TopHat analysis and next steps are faster than before.

Maria

ADD REPLYlink written 14 months ago by mvinas20
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 136 users visited in the last hour