7 months ago by
United States
Hello,
The format of the original uploaded fastq data (dataset 3 and 20) appear to be Ok. However, I can see that the Trinity assembly is failing at an early step.
I would suggest loading the data directly from the source into Galaxy and manipulating it within Galaxy (trim, other QA). Use the tool Download and Extract Reads in FASTA/Q format from NCBI SRA. Then try a rerun. Sometimes data from this source needs standardized reformatting, see the help here for details: https://galaxyproject.org/support/ncbi-sra-fastq/
The datatype attribute will be automatically assigned. Avoid assigning a database attribute. If you want to annotate datasets by the source species (or any other info), consider using Tags: https://galaxyproject.org/tutorials/histories/#tagging-datasets
If the job fails again with the cleaned up inputs, it is probably too large to run at Galaxy Main https://usegalaxy.org. The tool itself has no known issues and see that you have had other successful runs using different inputs. Choices:
- You could try running a sample/subset of the data through as a test (or for the final result, as there are many duplicated reads, see the FastQC report for details). To sub-sample randomly, convert with Fastq-to-Tabular, run the tool Select random lines from a file, then convert back with Tabular-to-Fastq.
- Consider setting up your own Galaxy server and allocating sufficient memory. Cloudman is a good choice for many. https://galaxyproject.github.io/ and https://galaxyproject.org/choices
Note: The database you have been assigning is not hg38, but another human database. Human hg38 is the genome you had successful mapping against, and because of the fastq database assignment being different, the BAM and other results are ending up with the wrong database assignment (inherited from the fastq input) for tools like Tophat and Cufflinks. This is a known bug we are working to resolve. For now, do not assign datatype for fastq inputs. This will not be a factor for Trinity assembly but I would still avoid the database assignment for fastq/fasta inputs when using most tools. If you must assign it, make sure it is correct (the same database used in the rest of the analysis). https://github.com/galaxyproject/usegalaxy-playbook/issues/104. How to remove/adjust metadata assignments: https://galaxyproject.org/support/metadata/
Thanks! Jen, Galaxy team
I have tried to convert it to fasta format and run trinity, still fail with same error.
I followed the insruction below to evaluate my fastq file, Run FastQC first to assess the type Run FASTQ Groomer if the data needs to have the quality scores rescaled If you are certain that the quality scores are already scaled to Sanger Phred+33 (the result of an Illumina 1.8+ pipeline), the datatype ".fastqsanger" can be directly assinged. Click the pencil icon to reach the Edit Attributes form. In the center panel, click on the "Datatype" tab (3rd), enter the datatype ".fastqsanger", and save. Metadata will assign, then the dataset can be used.
Filename filter2SRR299028_fastq File type Conventional base calls Encoding Sanger / Illumina 1.9 Total Sequences 9399306 Sequences flagged as poor quality 0 Sequence length 100 %GC 41
I changed the datatype to fasqsanger and rerun, still fail, I dont know what is going on.