Failed to run trinity on a fastq file

-------------- Trinity Phase 1: Clustering of RNA-Seq Reads ---------------------

Wednesday, April 11, 2018: 04:38:39 CMD: cat /pylon5/mc48nsp/xcgalaxy/main/staging//19061292/inputs/dataset_24604640.dat | /opt/packages/trinity/2.2.0/trinity-plugins/fastool/fastool --illumina-trinity --to-fasta >> single.fa 2> /pylon5/mc48nsp/xcgalaxy/main/staging//19061292/inputs/dataset_24604640.dat.readcount Trinity run failed. Must investigate error above.

7 months ago by

Jennifer Hillman Jackson ♦ 25k

United States

Jennifer Hillman Jackson ♦ 25k wrote:

Hello,

The format of the original uploaded fastq data (dataset 3 and 20) appear to be Ok. However, I can see that the Trinity assembly is failing at an early step.

I would suggest loading the data directly from the source into Galaxy and manipulating it within Galaxy (trim, other QA). Use the tool Download and Extract Reads in FASTA/Q format from NCBI SRA. Then try a rerun. Sometimes data from this source needs standardized reformatting, see the help here for details: https://galaxyproject.org/support/ncbi-sra-fastq/

The datatype attribute will be automatically assigned. Avoid assigning a database attribute. If you want to annotate datasets by the source species (or any other info), consider using Tags: https://galaxyproject.org/tutorials/histories/#tagging-datasets

If the job fails again with the cleaned up inputs, it is probably too large to run at Galaxy Main https://usegalaxy.org. The tool itself has no known issues and see that you have had other successful runs using different inputs. Choices:

You could try running a sample/subset of the data through as a test (or for the final result, as there are many duplicated reads, see the FastQC report for details). To sub-sample randomly, convert with Fastq-to-Tabular, run the tool Select random lines from a file, then convert back with Tabular-to-Fastq.
Consider setting up your own Galaxy server and allocating sufficient memory. Cloudman is a good choice for many. https://galaxyproject.github.io/ and https://galaxyproject.org/choices

Note: The database you have been assigning is not hg38, but another human database. Human hg38 is the genome you had successful mapping against, and because of the fastq database assignment being different, the BAM and other results are ending up with the wrong database assignment (inherited from the fastq input) for tools like Tophat and Cufflinks. This is a known bug we are working to resolve. For now, do not assign datatype for fastq inputs. This will not be a factor for Trinity assembly but I would still avoid the database assignment for fastq/fasta inputs when using most tools. If you must assign it, make sure it is correct (the same database used in the rest of the analysis). https://github.com/galaxyproject/usegalaxy-playbook/issues/104. How to remove/adjust metadata assignments: https://galaxyproject.org/support/metadata/

Thanks! Jen, Galaxy team

ADD COMMENT • link written 7 months ago by Jennifer Hillman Jackson ♦ 25k

-------------- Trinity Phase 1: Clustering of RNA-Seq Reads ---------------------

Similar posts • Search »