I am doing RNA-seq analysis for several mouse samples and I encounter problems during differential expression analysis. First, I used Galaxy tools to clean,filter, and trim my reads and Tophat for alignment. I selected the built-in genome mm10 for alignment and the mapping efficient is above 85%. Then, I used cufflinks without using reference annotation and I continued to do cuffmerge with the ensembl Mus_musculus.GRCm38.79.gtf annotation, followed by Cuffdiff. The results showed zero in values, fold changes, test_stat for every gene. I downloaded the BAM file and analyzed it with HTSeq-count using the Mus_musculus.GRCm38.79.gtf, and I found that all the reads had no feature or ambiguous alignment. I was wondering if I have used the wrong mapping genome or annotation file. Specifically, I have the following questions:
1. Is the built-in mouse genome in galaxy obtained from UCSC?
2. Shall I use the UCSC gff file for Cuffmerge annotation? How can I download gff file from UCSC?
3. If I want to use ensembl annotation, should I map my reads to the ensembl genome?
4. In the RNA-seq tutorial for Galaxy, reference annotation is not used during Cufflinks but used for Cuffmerge. Is this the correct method for DE analysis in my case?
5. I split the reads into 8-10 smaller FASTQ files before alignment using Tophat. Afterwards, I merged all the resulting BAM files to a single BAM file and converted it to SAM. However, this SAM file is not able to be processed with HTSeq-count. But the smaller BAM files directly resulting Tophat work well with HTSeq-count. What is the problem?
Thank you very much for your attentions and I look forward to the answers!!
Hello Xiefan
It looks like you're using the right version of the genome build. According to Ensembl: "GRCm38, was released by the Genome Reference Consortium in January 2012... This assembly is used by UCSC to create their mm10 database". So mm10 and m38 should be compatible.
You mention that you've followed an RNA-Seq tutorial for Galaxy. I'm not sure which one this is, but is there a reason that you are using Cufflinks and Cuffmerge to build your own GTF file? For most purposes, the Ensembl Mus_musculus.GRCm38.79.gtf should be suitable input as the transcripts file in Cuffdiff, without having to go through Cufflinks and Cuffdiff.
Even so, you should still have got something other than zeros for at least some of your genes, and since neither Cuffdiff nor HTSeq-count are giving anything, I suspect that it might be a problem with your merged BAM file. Was there a reason for splitting it up and merging all the resulting BAM files afterwards, or can you try remapping with the intact FASTQ file and see if that BAM file works? And what error were you getting from HTSeq-count when you tried the merged file - that might help identify if that is indeed the problem
Cheers
Mark