Unzip files on Galaxy

4.4 years ago by

United States

Jennifer Hillman Jackson ♦ 25k wrote:

Hello,

Uploaded (browsing or FTP or "Get Data -> EBI SRA") datasets originally in .gz format will uncompress automatically when loaded into a history. This is true on the Main public Galaxy server at http://usegalaxy.org, Cloudman instances http://usegalaxy.org/cloud, and local Galaxy instances by default (when FTP is enabled with the rest at basic configuration) http://getgalaxy.org - plus many Public Galaxy instances https://wiki.galaxyproject.org/PublicGalaxyServers (unless the administrator disabled the option - direct contact with their support would be the best way to trouble shoot, but let we can help you to make contact/clarify details if you need assistance).

UPDATE: tar.gz should not be an issue from SRA through routine methods, but please know that tar archives will uncompress just the first dataset, discarding the rest, so are not recommended. Uncompress the archive first locally, then load dataset indivudually (recompressing each as .gz will speed loading).

I suspect that you are noting the dataset name (will often still have the .gz appended to this attribute), but what you want to examine is the "datatype" metadata. You'll be able to visualize this in the UI in the dataset's box - is clearly labeled. For new uploads, unless format "autodetect" was not selected, the "datatype" will most likely simply be ".fastq".

To work with this data further in Galaxy, determining the correct quality scaling and assigning a more specific datatype that describes the quality score scaling (rescaled with the tool "FASTQ Groomer" if needed), is the next step. This wiki section has instructions, including a screencast example. This is incredibly important (essential!) to get correct at the very start of an analysis run for valid results.
https://wiki.galaxyproject.org/Support#FASTQ_Datatype_QA

The accession from the example above can be used with the tool "Get Data -> EBI SRA" as a search to locate the exact full dataset. As is the case with the majority, if not all, SRA data obtained this way, there are two options for import into Galaxy. The table for an experiment will have two import obtains: "submitted" which is the original author data (raw) and the "processed" that has undergone manipulations by SRA to scale the quality scores to Fastq Sanger Phred with an ASCII offset of +333 - otherwise known as the datatype label ".fastqsanger" in Galaxy. Use the "processed" to quickly move forward with analysis. Run FastQC to confirm (also highly recommended for other basic metrics- e.g. QC for content). Most tools require this "datatype" assignment be made as a starting point for correct data interpretation and some will not even recognize a .fastq dataset as valid input without it.

I provided a bit more information that you asked for, but all is interrelated, and overlooking these early QA steps are a quite common cause of issues downstream. So, I tend to share whenever I can! Please make use of what is helpful for your case, and hopefully it will aid others reading - in a similar situation or just following new posts.

If you continue to have what you suspect to be loaded uncompression problems, please explain a bit more about where you are working and the exact steps you have done so far. Include the metadata assignments for the dataset. We can work from there to resolve any outstanding issues.

Cheers! Jen, Galaxy team

ADD COMMENT • link modified 4.4 years ago • written 4.4 years ago by Jennifer Hillman Jackson ♦ 25k

I am unable to retrieve my account on galaxy and it says that it has been active since 6/19/14. I have the correct username and password but it says the account does not exist. Should I make a new one with my other email address or can I get technical support to fix the account I've already made. I actually had started a pipeline of data on the account I would like to save it if possible.

ADD REPLY • link written 4.4 years ago by smthorpe • 10

Hi Jennifer!

Thank very much for your help. I need clarification. If I upload fastq.gz files (Illumina 1.9 encoded) and need to convert them to .fastqsanger, what should I do? Just assign datatype to .fastqsanger? Do I need then to convert them by using Convert in Edit attributes? As far as I could note, the latter option is more time-consuming.

ADD REPLY • link written 21 months ago by lamteva.vera • 0

Hello, An uploaded fastq.gz dataset that is known to be Illumina 1.9 encoded can be directly assigned the datatype fastqsanger.gz. This can be specified in the Upload tool during upload. Or this can be assigned after upload by clicking on the pencil icon for the dataset and changing the type (Edit Attributes > Datatype).

There is no need to convert to another format or uncompress the data. Tools can use the compressed fastq data as input. Or have you discovered one (or some) that will not? The change is new so this is possible.

Jen, Galaxy team

ADD REPLY • link written 21 months ago by Jennifer Hillman Jackson ♦ 25k

2.7 years ago by

mathew.mano • 0

mathew.mano • 0 wrote:

Hi I need your suggestion in regard to Mapping with Bowtie for Illumina (Galaxy Version 1.1.2). I used EBI SRA method to upload my data. They were in fastq.gz format. I performed the quality check and it worked fine but when I tried to do the mapping with bowtie for illumina I am unable to select the dataset in the option FASTQ file. The data is single ended as well. Please kindly guide or suggest a way to perform alignment.

Thank you for your suggestion in advance

ADD COMMENT • link written 2.7 years ago by mathew.mano • 0

This application may require you to change from fastq format to a Sam or Bam file for use in bow tie.

ADD REPLY • link written 2.7 years ago by smthorpe • 10

The specific flavor of fastq must be specified (assigned as a datatype). See Greg Von Kuster's answer below. If using Galaxy Main, this tool's version is at: Map with BWA for Illumina (Galaxy Version 1.2.3)

Either fastqsanger or fastqillumina is required. This wiki (and the linked video) show how to determine and assign the type: https://wiki.galaxyproject.org/Support#Dataset_special_cases

Most tools still require fastqsanger. So converting to that (if needed) is probably the best path to avoid problems with downstream tools.

ADD REPLY • link written 2.7 years ago by Jennifer Hillman Jackson ♦ 25k

Thank you Jennifer and smthorpe for your suggestions. I could proceed successfully. Have a good day ahead.

ADD REPLY • link written 2.7 years ago by mathew.mano • 0

Similar posts • Search »