Question: Workflow Failure Issue with .dat Compressed Files on a Local Galaxy Instance
0
gravatar for othman.soufan
3 months ago by
othman.soufan0 wrote:

Greetings,

We have developed a workflow using a local galaxy server instance including the following steps:

  1. Input dataset collection
  2. Trim Galore (trimming)
  3. Tophat (alignment) ...

The workflow starts running fine and when step 2 finishes, the output files generated using Trim Galore are stored as .dat files on the galaxy instance. Galaxy considers the .dat extension for all files to be stored on disk. The ".dat" file is compressed in this case and when Tophat starts running, it generates an error as it is not able to recognize that this is only a ".gz" file which it can handle.

Tophat tool execution generates the following messages:

[2018-07-31 15:14:47] Beginning TopHat run (v2.1.0)
-----------------------------------------------
[2018-07-31 15:14:47] Checking for Bowtie
          Bowtie version:    2.2.5.0
[2018-07-31 15:14:47] Checking for Bowtie index files (genome)..
[2018-07-31 15:14:47] Checking for reference FASTA file
[2018-07-31 15:14:47] Generating SAM header for /home/bioinfotools/galaxy/tool-data/cJaponicaGenome/bowtie2_index/cJaponicaGenome/cJaponicaGenome
**Error: cannot determine record type in input file /home/bioinfotools/galaxy/database/files/001/dataset_1100.dat**

Internally, if the .dat files are de-compressed or the extension is just changed to .gz, Tophat works fine. However, what is the best practice to fix this issue?

I have came through this post which requires certain changes to the source code. While this is a possible workaround, it may affect future updates of the local instance.

Please let me know if there might be any other way to overcome such issue.

Regards,

Othman

ADD COMMENTlink modified 3 months ago • written 3 months ago by othman.soufan0

Hi Jennifer,

Please see my responses below:

  • Which Galaxy release are you running? 18.05 is the most current and strongly recommended.

    Using "ourip:8080/api/version", I get: version_major: "18.05" So, this is actually based on checking out the latest version of the galaxy server.

  • What was the source? Local via Github or a Docker image? The URL would be helpful.

    Github using: git fetch origin && git checkout release_18.05 && git pull --ff-only origin release_18.05

  • Are the wrapped tools included in the workflow all from the Galaxy Main ToolShed (MTS) https://usegalaxy.org/toolshed?

    Yes

  • Are these wrappers the most current version in the MTS? (will work best)

    I believe so

  • Did you check under Admin > Tool Management (in the GUI) to make sure that all dependencies are installed?

    Yes, all dependencies are installed and without a workflow all components are just running fine.

  • Do the tools run directly from the history correctly? (outside of a workflow)

    Yes (see the previous answer)

  • Does this workflow execute successfully at one of the Public UseGalaxy* servers? Try a sample of the data that is still representative of the whole. If you test at Galaxy Main https://usegalaxy.org and run into a problem, I'll be able to help you to review the inputs or any other usage issues that may come up in more detail. How to interpret/report a problem: https://galaxyproject.org/support/tool-error/

    Yes, the same problem persists. If when uploading the files, a de-compression takes place, the pipeline runs fine. However, the problem exist with compressed .gz files as tophat seem not able to figure out the type of file. An example of generated commands in workflow:

The following error is generated on the server: Fatal error: Tool execution failed Building a SMALL index

[2018-08-01 21:29:13] Beginning TopHat run (v2.1.1)

[2018-08-01 21:29:13] Checking for Bowtie Bowtie version: 2.2.8.0 [2018-08-01 21:29:13] Checking for Bowtie index files (genome).. [2018-08-01 21:29:13] Checking for reference FASTA file [2018-08-01 21:29:13] Generating SAM header for genome Error: cannot determine record type in input file /galaxy-repl/main/files/026/459/dataset_26459197.dat

The following command was executed during the workflow run in the local instance:

tophat2  --num-threads ${GALAXY_SLOTS:-4}  --read-mismatches 2  --read-edit-dist 2 --read-realign-edit-dist 1000 -a 8 -m 0 -i 70 -I 500000 -g 20 --min-segment-intron 50 --max-segment-intron 500000 --segment-mismatches 2 --segment-length 25 --library-type fr-unstranded  --max-insertion-length 3 --max-deletion-length 3  -G /home/data/EcoTox/Meta/Cjaponica/GCF_001577835_1_Coturnix_japonica_2_0_genomic_nogeneid_removed.gff  --no-coverage-search       -r 300 --mate-std-dev=20   /home/bioinfotools/galaxy/tool-data/cJaponicaGenome/bowtie2_index/cJaponicaGenome/cJaponicaGenome "/home/bioinfotools/galaxy/database/files/001/dataset_1355.dat" "/home/bioinfotools/galaxy/database/files/001/dataset_1356.dat"

As can be seen, the .dat extension is used which is confusing tophat2. If I only change the extension to .gz, the command just runs fine.

Please let me know your advice for resolving the issue.

ADD REPLYlink written 3 months ago by othman.soufan0

Thanks for sending in the bug report. Could you also send in a share link to the workflow you used to generate the data? The problem may be in there if the tools work directly from the history.

Please generate the workflow share link and send it as a reply to your bug report or send it directly to galaxy-bugs@lists.galaxyproject.org using the same email address as you use for your galaxy account/sending a bug report.

ADD REPLYlink modified 3 months ago • written 3 months ago by Jennifer Hillman Jackson25k

No extra info needed - I found the problem. :)

The input fastq datasets need to be assigned the datatype fastqsanger or fastqsanger.gz depending on if they are really compressed or not. The datatype fastq.gz does not create recognizable inputs for this tool and many others that accept fastq inputs. Trim Galore!, FastQC, and other data preparation tools are usually the exceptions, so this type of problem tends to show up at the first true analysis step (usually when mapping).

The Support FAQs here explain more about fastq data in Galaxy: https://galaxyproject.org/support/#getting-inputs-right

And the Galaxy Tutorials here include examples of manipulating fastq data: https://galaxyproject.org/learn/ (all, including general) and http://galaxyproject.github.io/training-material/topics/galaxy-data-manipulation/ (advanced collection methods).

ADD REPLYlink modified 3 months ago • written 3 months ago by Jennifer Hillman Jackson25k

Hi Jennifer,

Thanks so much for your follow-ups and replies. As for the usegalaxy server, the input files indeed had fastq.gz. However, over the local instance where this was failing, the assigned data type was fastqsanger.gz

When I build the collection, the .fastq extension is maintained in the label name of the file pair. So, I have just modified the extension of the input data file on our server from .fastq.gz to .fastqsanger.gz and executing a run at the moment.

As for the usegalaxy server, I am started a run where I make sure the assigned data type is fastqsanger.gz and then, will give you an update.

Somehow, I still believe that it is happening because of the underlying tophat command being executed with ".dat" extension and not being told that this is just a "compressed" version.

ADD REPLYlink written 3 months ago by othman.soufan0

Just in case, I have also replied back to the bug report email with the shared link of the designed workflow.

ADD REPLYlink written 3 months ago by othman.soufan0

Do you kindly have any update on this issue?

ADD REPLYlink written 3 months ago by othman.soufan0
0
gravatar for Jennifer Hillman Jackson
3 months ago by
United States
Jennifer Hillman Jackson25k wrote:

Hello,

That post is much older (7.5 years ago) and the Galaxy architecture has matured since then.

Galaxy tool wrappers do not rely on file names (as displayed in the web GUI or as stored on disk) but the assigned datatype metadata attribute to determine file formats.

The starting fastq inputs should be given the datatype fastqsanger or fastqsanger.gz during Upload or after, depending on if these are compressed or not https://galaxyproject.org/support/#getting-inputs-right. It sounds like the data are compressed, so changing the datatype to fastqsanger.gz, then creating the input dataset collection(s) will likely solve the problem. Or, you can re-Upload and set the datatype plus create data collection(s) all at once, how to: http://galaxyproject.github.io/training-material/topics/galaxy-data-manipulation/

Resources:

If changing the datatype does not resolve the problem, a few questions to help troubleshoot more:

  • Which Galaxy release are you running? 18.05 is the most current and strongly recommended.
  • What was the source? Local via Github or a Docker image? The URL would be helpful.
  • Are the wrapped tools included in the workflow all from the Galaxy Main ToolShed (MTS) https://usegalaxy.org/toolshed?
  • Are these wrappers the most current version in the MTS? (will work best)
  • Did you check under Admin > Tool Management (in the GUI) to make sure that all dependencies are installed?
  • Do the tools run directly from the history correctly? (outside of a workflow)
  • Does this workflow execute successfully at one of the Public UseGalaxy* servers? Try a sample of the data that is still representative of the whole. If you test at Galaxy Main https://usegalaxy.org and run into a problem, I'll be able to help you to review the inputs or any other usage issues that may come up in more detail. How to interpret/report a problem: https://galaxyproject.org/support/tool-error/

Please try adjusting the datatype first as that is probably the root of the issue. Thanks! Jen, Galaxy team

ADD COMMENTlink modified 3 months ago • written 3 months ago by Jennifer Hillman Jackson25k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 169 users visited in the last hour