We have developed a workflow using a local galaxy server instance including the following steps:
- Input dataset collection
- Trim Galore (trimming)
- Tophat (alignment) ...
The workflow starts running fine and when step 2 finishes, the output files generated using Trim Galore are stored as .dat files on the galaxy instance. Galaxy considers the .dat extension for all files to be stored on disk. The ".dat" file is compressed in this case and when Tophat starts running, it generates an error as it is not able to recognize that this is only a ".gz" file which it can handle.
Tophat tool execution generates the following messages:
[2018-07-31 15:14:47] Beginning TopHat run (v2.1.0) ----------------------------------------------- [2018-07-31 15:14:47] Checking for Bowtie Bowtie version: 126.96.36.199 [2018-07-31 15:14:47] Checking for Bowtie index files (genome).. [2018-07-31 15:14:47] Checking for reference FASTA file [2018-07-31 15:14:47] Generating SAM header for /home/bioinfotools/galaxy/tool-data/cJaponicaGenome/bowtie2_index/cJaponicaGenome/cJaponicaGenome **Error: cannot determine record type in input file /home/bioinfotools/galaxy/database/files/001/dataset_1100.dat**
Internally, if the .dat files are de-compressed or the extension is just changed to .gz, Tophat works fine. However, what is the best practice to fix this issue?
I have came through this post which requires certain changes to the source code. While this is a possible workaround, it may affect future updates of the local instance.
Please let me know if there might be any other way to overcome such issue.
Please see my responses below:
Which Galaxy release are you running? 18.05 is the most current and strongly recommended.
Using "ourip:8080/api/version", I get: version_major: "18.05" So, this is actually based on checking out the latest version of the galaxy server.
What was the source? Local via Github or a Docker image? The URL would be helpful.
Github using: git fetch origin && git checkout release_18.05 && git pull --ff-only origin release_18.05
Are the wrapped tools included in the workflow all from the Galaxy Main ToolShed (MTS) https://usegalaxy.org/toolshed?
Are these wrappers the most current version in the MTS? (will work best)
I believe so
Did you check under Admin > Tool Management (in the GUI) to make sure that all dependencies are installed?
Yes, all dependencies are installed and without a workflow all components are just running fine.
Do the tools run directly from the history correctly? (outside of a workflow)
Yes (see the previous answer)
Does this workflow execute successfully at one of the Public UseGalaxy* servers? Try a sample of the data that is still representative of the whole. If you test at Galaxy Main https://usegalaxy.org and run into a problem, I'll be able to help you to review the inputs or any other usage issues that may come up in more detail. How to interpret/report a problem: https://galaxyproject.org/support/tool-error/
Yes, the same problem persists. If when uploading the files, a de-compression takes place, the pipeline runs fine. However, the problem exist with compressed .gz files as tophat seem not able to figure out the type of file. An example of generated commands in workflow:
The following error is generated on the server: Fatal error: Tool execution failed Building a SMALL index
[2018-08-01 21:29:13] Beginning TopHat run (v2.1.1)
[2018-08-01 21:29:13] Checking for Bowtie Bowtie version: 188.8.131.52 [2018-08-01 21:29:13] Checking for Bowtie index files (genome).. [2018-08-01 21:29:13] Checking for reference FASTA file [2018-08-01 21:29:13] Generating SAM header for genome Error: cannot determine record type in input file /galaxy-repl/main/files/026/459/dataset_26459197.dat
The following command was executed during the workflow run in the local instance:
As can be seen, the .dat extension is used which is confusing tophat2. If I only change the extension to .gz, the command just runs fine.
Please let me know your advice for resolving the issue.
Thanks for sending in the bug report. Could you also send in a share link to the workflow you used to generate the data? The problem may be in there if the tools work directly from the history.
Please generate the workflow share link and send it as a reply to your bug report or send it directly to firstname.lastname@example.org using the same email address as you use for your galaxy account/sending a bug report.
No extra info needed - I found the problem. :)
The input fastq datasets need to be assigned the datatype
fastqsanger.gzdepending on if they are really compressed or not. The datatype
fastq.gzdoes not create recognizable inputs for this tool and many others that accept fastq inputs. Trim Galore!, FastQC, and other data preparation tools are usually the exceptions, so this type of problem tends to show up at the first true analysis step (usually when mapping).
The Support FAQs here explain more about fastq data in Galaxy: https://galaxyproject.org/support/#getting-inputs-right
And the Galaxy Tutorials here include examples of manipulating fastq data: https://galaxyproject.org/learn/ (all, including general) and http://galaxyproject.github.io/training-material/topics/galaxy-data-manipulation/ (advanced collection methods).
Thanks so much for your follow-ups and replies. As for the usegalaxy server, the input files indeed had fastq.gz. However, over the local instance where this was failing, the assigned data type was fastqsanger.gz
When I build the collection, the .fastq extension is maintained in the label name of the file pair. So, I have just modified the extension of the input data file on our server from .fastq.gz to .fastqsanger.gz and executing a run at the moment.
As for the usegalaxy server, I am started a run where I make sure the assigned data type is fastqsanger.gz and then, will give you an update.
Somehow, I still believe that it is happening because of the underlying tophat command being executed with ".dat" extension and not being told that this is just a "compressed" version.
Just in case, I have also replied back to the bug report email with the shared link of the designed workflow.
Do you kindly have any update on this issue?