Question: Unzip files on Galaxy
1
gravatar for smthorpe
3.0 years ago by
smthorpe10
United States
smthorpe10 wrote:

 

Hello,

I have a fastq.gz file downloaded from SRA that need unzipped. How can I do this on Galaxy?

 

rna-seq galaxy • 5.1k views
ADD COMMENTlink modified 15 months ago by Greg Von Kuster810 • written 3.0 years ago by smthorpe10
1
gravatar for Jennifer Hillman Jackson
3.0 years ago by
United States
Jennifer Hillman Jackson22k wrote:

Hello,

Uploaded (browsing or FTP or "Get Data -> EBI SRA") datasets originally in .gz format will uncompress automatically when loaded into a history. This is true on the Main public Galaxy server at http://usegalaxy.org, Cloudman instances http://usegalaxy.org/cloud, and local Galaxy instances by default (when FTP is enabled with the rest at basic configuration) http://getgalaxy.org - plus many Public Galaxy instances https://wiki.galaxyproject.org/PublicGalaxyServers (unless the administrator disabled the option - direct contact with their support would be the best way to trouble shoot, but let we can help you to make contact/clarify details if you need assistance). 

UPDATE: tar.gz should not be an issue from SRA through routine methods, but please know that tar archives will uncompress just the first dataset, discarding the rest, so are not recommended. Uncompress the archive first locally, then load dataset indivudually (recompressing each as .gz will speed loading).

I suspect that you are noting the dataset name (will often still have the .gz appended to this attribute), but what you want to examine is the "datatype" metadata. You'll be able to visualize this in the UI in the dataset's box - is clearly labeled. For new uploads, unless format "autodetect" was not selected, the "datatype" will most likely simply be ".fastq". 

To work with this data further in Galaxy, determining the correct quality scaling and assigning a more specific datatype that describes the quality score scaling (rescaled with the tool "FASTQ Groomer" if needed), is the next step. This wiki section has instructions, including a screencast example. This is incredibly important (essential!) to get correct at the very start of an analysis run for valid results.
https://wiki.galaxyproject.org/Support#FASTQ_Datatype_QA

The accession from the example above can be used with the tool "Get Data -> EBI SRA" as a search to locate the exact full dataset. As is the case with the majority, if not all, SRA data obtained this way, there are two options for import into Galaxy. The table for an experiment will have two import obtains: "submitted" which is the original author data (raw) and the "processed" that has undergone manipulations by SRA to scale the quality scores to Fastq Sanger Phred with an ASCII offset of +333 - otherwise known as the datatype label ".fastqsanger" in Galaxy. Use the "processed" to quickly move forward with analysis. Run FastQC to confirm (also highly recommended for other basic metrics- e.g. QC for content). Most tools require this "datatype" assignment be made as a starting point for correct data interpretation and some will not even recognize a .fastq dataset as valid input without it.

I provided a bit more information that you asked for, but all is interrelated, and overlooking these early QA steps are a quite common cause of issues downstream. So, I tend to share whenever I can! Please make use of what is helpful for your case, and hopefully it will aid others reading - in a similar situation or just following new posts.

If you continue to have what you suspect to be loaded uncompression problems, please explain a bit more about where you are working and the exact steps you have done so far. Include the metadata assignments for the dataset. We can work from there to resolve any outstanding issues. 

Cheers! Jen, Galaxy team

ADD COMMENTlink modified 3.0 years ago • written 3.0 years ago by Jennifer Hillman Jackson22k

I am unable to retrieve my account on galaxy and it says that it has been active since 6/19/14. I have the correct username and password but it says the account does not exist. Should I make a new one with my other email address or can I get technical support to fix the account I've already made. I actually had started a pipeline of data on the account I would like to save it if possible.

ADD REPLYlink written 3.0 years ago by smthorpe10

Hi Jennifer!

Thank very much for your help. I need clarification. If I upload fastq.gz files (Illumina 1.9 encoded) and need to convert them to .fastqsanger, what should I do? Just assign datatype to .fastqsanger? Do I need then to convert them by using Convert in Edit attributes? As far as I could note, the latter option is more time-consuming.

ADD REPLYlink written 4 months ago by lamteva.vera0

Hello, An uploaded fastq.gz dataset that is known to be Illumina 1.9 encoded can be directly assigned the datatype fastqsanger.gz. This can be specified in the Upload tool during upload. Or this can be assigned after upload by clicking on the pencil icon for the dataset and changing the type (Edit Attributes > Datatype).

There is no need to convert to another format or uncompress the data. Tools can use the compressed fastq data as input. Or have you discovered one (or some) that will not? The change is new so this is possible.

Jen, Galaxy team

ADD REPLYlink written 4 months ago by Jennifer Hillman Jackson22k
1
gravatar for Greg Von Kuster
15 months ago by
Penn State University
Greg Von Kuster810 wrote:

It may be the case that your input dataset for the bowtie tool has datatype fastq, while bowtie requires fastqsanger. I'm not sure if Galaxy version 1.1.2 of the tool allows for more datatypes, but Galaxy version 1.1.4 allows for any one of fastqsanger,fastqillumina,fastqsolexa.

ADD COMMENTlink written 15 months ago by Greg Von Kuster810
0
gravatar for mathew.mano
15 months ago by
mathew.mano0 wrote:

Hi I need your suggestion in regard to Mapping with Bowtie for Illumina (Galaxy Version 1.1.2). I used EBI SRA method to upload my data. They were in fastq.gz format. I performed the quality check and it worked fine but when I tried to do the mapping with bowtie for illumina I am unable to select the dataset in the option FASTQ file. The data is single ended as well. Please kindly guide or suggest a way to perform alignment.

Thank you for your suggestion in advance

ADD COMMENTlink written 15 months ago by mathew.mano0

This application may require you to change from fastq format to a Sam or Bam file for use in bow tie.

ADD REPLYlink written 15 months ago by smthorpe10

The specific flavor of fastq must be specified (assigned as a datatype). See Greg Von Kuster's answer below. If using Galaxy Main, this tool's version is at: Map with BWA for Illumina (Galaxy Version 1.2.3)

Either fastqsanger or fastqillumina is required. This wiki (and the linked video) show how to determine and assign the type: https://wiki.galaxyproject.org/Support#Dataset_special_cases

Most tools still require fastqsanger. So converting to that (if needed) is probably the best path to avoid problems with downstream tools.

ADD REPLYlink written 15 months ago by Jennifer Hillman Jackson22k

Thank you Jennifer and smthorpe for your suggestions. I could proceed successfully. Have a good day ahead.

ADD REPLYlink written 15 months ago by mathew.mano0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 118 users visited in the last hour