Greetings, I am trying to obtain basic alignment metrics for human 50bp-paired end RNA-seq files (fastq) I processed and aligned to hg19 with Tophat2 galaxy, and have so far failed using Picard 'CollectRnaSeqMetrics' in the main/public instance of Galaxy.
The output of this tool indicates all my reads are intergenic (literally no exonic mapping), which is not correct according to the bigwigs/Bams I can visualize in UCSC GB or IGV.
I get the same result (all intergenic reads) irrespective of whether I use straight Tophat2 output (bam), or Tophat2 output2 sorted with Picard Tools in Galaxy (sorted by coordinate bam).
Ditto whether I select the first '11 fields' as output for the RefFlat file imported to Galaxy from UCSC GB (hg19 RefSeq, per instructions) or try importing a gtf file of hg19 from UCSC GB.
Can anyone tell me where I am going wrong? This must be simple and obvious, but I don't see it. Help much appreciated.
Suspicions/questions you may have about parameters run for 'CollectRnaSeqMetrics' in Picard Tools:
1) reference genome: it's hg19 canonical, same as Tophat2 alignment that generated BAM in question
2) Gene annotations in refFlat form: I selected the first 11 fields from table browser in hg19 (tool instructions, but also tried 'all fields' and '.gtf' as refFlat output to Galaxy)
3) RNA-seq library strand specificity: 'none' (paired end)
4) assume sorted: No (but whether BAM is Picard coordinate -sorted or -unsorted doesn't change wholly intergenic mapping)
5) the level(s) at which to accumulate metrics: all reads
6) remaining parameters = default
PF_ALIGNED_BASES = 3629418122 INTERGENIC_BASES = 3629418122 CODING_BASES =0 INTRONIC_BASES =0
Picked up _JAVA_OPTIONS: -Djava.io.tmpdir=/galaxy-repl/main/scratch -Xmx7680m -Xms256m
p.s. After filtering by quality (>20), I aligned my fastq files to hg19 canonical, and the resulting BAM file is 7GB (> 30k reads for each F and R fastq file).
help sincerely appreciated,
jake
When in doubt, look at the BAM file in IGV and see if the results of picard seem reasonable compared to what you visually observe.
Mr. Ryan, thank you for your response (A+ Biostar). I checked both in IGV and UCSC GB as you suggested, and having 0 exonic reads doesn't seem to be the problem. This has to be something with the reference file format, right? Frustrating.
Ah, it looks like you tried to make the "refFlat" file from the UCSC table browser. While one would expect that to work, it won't (I agree that this is highly annoying). Here is a link for the file you'll need. Upload that to Galaxy and I expect Picard will be happy.
Thanks again. I broke-down and downloaded the new Picard Tools and am trying this command-line (https://broadinstitute.github.io/picard/index.html#QuickStart). I have 35 BAMs and am unclear yet whether I have to sort all these w/Picard so I'll still be running it 5 years from now.
The refFlat file I downloaded from UCSC GB did not work (error: " Wrong number of fields in refFlat file hg19gtf.txt at line 1"). However, the refFlat file you provided as a link did work. Thus= you remain King of the Galaxy.
They probably need to be sorted, but whether you do it with picard or something else doesn't matter. So if you have them from Galaxy already (in which case they're already sorted), then don't bother redoing that.