I keep getting a reference annotation error when I run cuffdiff using HISAT2 alignment files. The cuffdiff file is still produced and looks normal. When I use TopHap2 alignment files (from the same dataset) I don't get the Cuffdiff reference annotation error. I've tried multiple datasets and always get the same results: reference annotation error when aligning with HISAT2 and no error if aligning with TopHat2. Has anyone encountered this before? This is my protocol: Download FASTQ files from SRA, HISAT2 or TOPHAT alignment, Cufflinks, Cuffmerge, Cuffdiff. I am new to NGS mapping and analysis.
Hi,
This does seem odd. I would suggest reviewing the Cuffdiff output to see if the reference annotation dataset (GTF) was really used or not. Specifically, check for gene_id and transcript_id values (from the GTF) instead of "XLOC" identifiers (the default). At least some output lines should have gene_id/transcript_id incorporated. I suspect that the Cuff* output doesn't if HISAT is rejecting the dataset. The usage/error trapping was improved in the newer tool wrappers. The Cuff* tools are a bit dated and considered deprecated for both scientific and complicated usage (leading to errors) reasons.
Whenever reference annotation is used it is very important that it is an exact match for the target genome/build being mapped against. The chromosome identifiers must be a match, there should be no description content on a custom genome's sequence identifier lines (">" lines) -- just sequence names, and the database metadata attribute needs to be assigned to the GTF for many tools to accept it as proper input (checks that value versus the target genome's database name).
The troubleshooting and input Support FAQs here can help to resolve the majority of problems with reference annotation across tools: https://galaxyproject.org/support/#troubleshooting. If you cannot identify the problem after reviewing the help and can reproduce the problem at Galaxy Main https://usegalaxy.org, a bug report can be sent in for feedback (how-to is also included in the FAQs). Please do not delete datasets (inputs/outputs) associated with the reported error or our ability to help will be limited. I reviewed your current active histories and some of your deleted histories at Galaxy Main already but couldn't locate the history that contains this problem (maybe it occurred at a different Galaxy server?).
And the Galaxy Tutorials here cover RNA-seq analysis. It is advised to switch away from the Cuff* tools and instead use the newer tools/methods, with inputs that have proper content, format, and labels (datatype, database). https://galaxyproject.org/learn/
Thanks! Jen, Galaxy team