I am attempting to run CollectRnaSeqMetrics on a tophat run (with my gff3 annotations and fasta genome for reference). To note, I am dealing with Danaus plexippus (monarch butterfly). The program "successfully" executes, yet it tells me that all of my reads/bases are intergenic, which is clearly not true. When I simultaneously visualize the tophat hits with my gff3 annotations, I see precisely the info I'm looking for i.e. the overlap of reads onto the gene models. This leads me to believe that intervals are being scrambled somewhere along the way. I "tidied" my original gff3 with genometools and converted the tidy gff3 to refFlat with gff3ToGenPred (UCSC). By quick inspection, the intervals in the refFlat file are unchanged from the original gff3 (that I can visualize). Anyone have an idea of what may be happening here? Any and all help appreciated!
The problem is due to the format of the annotation dataset (#5). It is in genePred format, but not refFlat format. See this FAQ (near end) to understand the difference.
Specifically, I noticed that dataset 5 needs one more column added between column 1 and column 2 - so that column 3 is the "chromosome" while also placing the latter fields in the expected columns for refFlat format.
Try correcting the format and then rerun the tool. You may need to resort the input BAM dataset first (SortSam) and wrap the lines of the custom reference genome (NormalizeFasta). Both are required by many downstream tools for correct, successful results.
Hope this helps! Jen, Galaxy team