Question

Question: Workflow Failure Issue with .dat Compressed Files on a Local Galaxy Instance

0

3 months ago by

othman.soufan • 0

othman.soufan • 0 wrote:

Greetings,

We have developed a workflow using a local galaxy server instance including the following steps:

Input dataset collection
Trim Galore (trimming)
Tophat (alignment) ...

The workflow starts running fine and when step 2 finishes, the output files generated using Trim Galore are stored as .dat files on the galaxy instance. Galaxy considers the .dat extension for all files to be stored on disk. The ".dat" file is compressed in this case and when Tophat starts running, it generates an error as it is not able to recognize that this is only a ".gz" file which it can handle.

Tophat tool execution generates the following messages:

[2018-07-31 15:14:47] Beginning TopHat run (v2.1.0)
-----------------------------------------------
[2018-07-31 15:14:47] Checking for Bowtie
          Bowtie version:    2.2.5.0
[2018-07-31 15:14:47] Checking for Bowtie index files (genome)..
[2018-07-31 15:14:47] Checking for reference FASTA file
[2018-07-31 15:14:47] Generating SAM header for /home/bioinfotools/galaxy/tool-data/cJaponicaGenome/bowtie2_index/cJaponicaGenome/cJaponicaGenome
**Error: cannot determine record type in input file /home/bioinfotools/galaxy/database/files/001/dataset_1100.dat**

Internally, if the .dat files are de-compressed or the extension is just changed to .gz, Tophat works fine. However, what is the best practice to fix this issue?

I have came through this post which requires certain changes to the source code. While this is a possible workaround, it may affect future updates of the local instance.

Please let me know if there might be any other way to overcome such issue.

Regards,

Othman

fastq fastqsanger.gz galaxy datatype fastqsanger • 324 views

ADD COMMENT • link •

modified 3 months ago • written 3 months ago by othman.soufan • 0

Hi Jennifer,

Please see my responses below:

Which Galaxy release are you running? 18.05 is the most current and strongly recommended.

Using "ourip:8080/api/version", I get: version_major: "18.05" So, this is actually based on checking out the latest version of the galaxy server.
What was the source? Local via Github or a Docker image? The URL would be helpful.

Github using: git fetch origin && git checkout release_18.05 && git pull --ff-only origin release_18.05
Are the wrapped tools included in the workflow all from the Galaxy Main ToolShed (MTS) https://usegalaxy.org/toolshed?

Yes
Are these wrappers the most current version in the MTS? (will work best)

I believe so
Did you check under Admin > Tool Management (in the GUI) to make sure that all dependencies are installed?

Yes, all dependencies are installed and without a workflow all components are just running fine.
Do the tools run directly from the history correctly? (outside of a workflow)

Yes (see the previous answer)
Does this workflow execute successfully at one of the Public UseGalaxy* servers? Try a sample of the data that is still representative of the whole. If you test at Galaxy Main https://usegalaxy.org and run into a problem, I'll be able to help you to review the inputs or any other usage issues that may come up in more detail. How to interpret/report a problem: https://galaxyproject.org/support/tool-error/

Yes, the same problem persists. If when uploading the files, a de-compression takes place, the pipeline runs fine. However, the problem exist with compressed .gz files as tophat seem not able to figure out the type of file. An example of generated commands in workflow:

The following error is generated on the server: Fatal error: Tool execution failed Building a SMALL index

[2018-08-01 21:29:13] Beginning TopHat run (v2.1.1)

[2018-08-01 21:29:13] Checking for Bowtie Bowtie version: 2.2.8.0 [2018-08-01 21:29:13] Checking for Bowtie index files (genome).. [2018-08-01 21:29:13] Checking for reference FASTA file [2018-08-01 21:29:13] Generating SAM header for genome Error: cannot determine record type in input file /galaxy-repl/main/files/026/459/dataset_26459197.dat

The following command was executed during the workflow run in the local instance:

tophat2  --num-threads ${GALAXY_SLOTS:-4}  --read-mismatches 2  --read-edit-dist 2 --read-realign-edit-dist 1000 -a 8 -m 0 -i 70 -I 500000 -g 20 --min-segment-intron 50 --max-segment-intron 500000 --segment-mismatches 2 --segment-length 25 --library-type fr-unstranded  --max-insertion-length 3 --max-deletion-length 3  -G /home/data/EcoTox/Meta/Cjaponica/GCF_001577835_1_Coturnix_japonica_2_0_genomic_nogeneid_removed.gff  --no-coverage-search       -r 300 --mate-std-dev=20   /home/bioinfotools/galaxy/tool-data/cJaponicaGenome/bowtie2_index/cJaponicaGenome/cJaponicaGenome "/home/bioinfotools/galaxy/database/files/001/dataset_1355.dat" "/home/bioinfotools/galaxy/database/files/001/dataset_1356.dat"

As can be seen, the .dat extension is used which is confusing tophat2. If I only change the extension to .gz, the command just runs fine.

Please let me know your advice for resolving the issue.

ADD REPLY • link written 3 months ago by othman.soufan • 0

Thanks for sending in the bug report. Could you also send in a share link to the workflow you used to generate the data? The problem may be in there if the tools work directly from the history.

Please generate the workflow share link and send it as a reply to your bug report or send it directly to galaxy-bugs@lists.galaxyproject.org using the same email address as you use for your galaxy account/sending a bug report.

ADD REPLY • link modified 3 months ago • written 3 months ago by Jennifer Hillman Jackson ♦ 25k

No extra info needed - I found the problem. :)

The input fastq datasets need to be assigned the datatype fastqsanger or fastqsanger.gz depending on if they are really compressed or not. The datatype fastq.gz does not create recognizable inputs for this tool and many others that accept fastq inputs. Trim Galore!, FastQC, and other data preparation tools are usually the exceptions, so this type of problem tends to show up at the first true analysis step (usually when mapping).

The Support FAQs here explain more about fastq data in Galaxy: https://galaxyproject.org/support/#getting-inputs-right

And the Galaxy Tutorials here include examples of manipulating fastq data: https://galaxyproject.org/learn/ (all, including general) and http://galaxyproject.github.io/training-material/topics/galaxy-data-manipulation/ (advanced collection methods).

ADD REPLY • link modified 3 months ago • written 3 months ago by Jennifer Hillman Jackson ♦ 25k

Hi Jennifer,

Thanks so much for your follow-ups and replies. As for the usegalaxy server, the input files indeed had fastq.gz. However, over the local instance where this was failing, the assigned data type was fastqsanger.gz

When I build the collection, the .fastq extension is maintained in the label name of the file pair. So, I have just modified the extension of the input data file on our server from .fastq.gz to .fastqsanger.gz and executing a run at the moment.

As for the usegalaxy server, I am started a run where I make sure the assigned data type is fastqsanger.gz and then, will give you an update.

Somehow, I still believe that it is happening because of the underlying tophat command being executed with ".dat" extension and not being told that this is just a "compressed" version.

ADD REPLY • link written 3 months ago by othman.soufan • 0

Just in case, I have also replied back to the bug report email with the shared link of the designed workflow.

ADD REPLY • link written 3 months ago by othman.soufan • 0

Do you kindly have any update on this issue?

ADD REPLY • link written 3 months ago by othman.soufan • 0

[2018-08-01 21:29:13] Beginning TopHat run (v2.1.1)

Similar posts • Search »