Question: RNA-seq analysis in Galaxy
0
gravatar for Fang,Xiefan
3.6 years ago by
Fang,Xiefan30
United States
Fang,Xiefan30 wrote:

I am doing RNA-seq analysis for several mouse samples and I encounter problems during differential expression analysis. First, I used Galaxy tools to clean,filter, and trim my reads and Tophat for alignment. I selected the built-in genome mm10 for alignment and the mapping efficient is above 85%. Then, I used cufflinks without using reference annotation and I continued to do cuffmerge with the ensembl Mus_musculus.GRCm38.79.gtf annotation, followed by Cuffdiff. The results showed zero in values, fold changes, test_stat for every gene. I downloaded the BAM file and analyzed it with HTSeq-count using the Mus_musculus.GRCm38.79.gtf, and I found that all the reads had no feature or ambiguous alignment. I was wondering if I have used the wrong mapping genome or annotation file. Specifically, I have the following questions:

1. Is the built-in mouse genome in galaxy obtained from UCSC?

2. Shall I use the UCSC gff file for Cuffmerge annotation? How can I download gff file from UCSC?

3. If I want to use ensembl annotation, should I map my reads to the ensembl genome?

4. In the RNA-seq tutorial for Galaxy, reference annotation is not used during Cufflinks but used for Cuffmerge. Is this the correct method for DE analysis in my case?

5. I split the reads into 8-10 smaller FASTQ files before alignment using Tophat. Afterwards, I merged all the resulting BAM files to a single BAM file and converted it to SAM. However, this SAM file is not able to be processed with HTSeq-count. But the smaller BAM files directly resulting Tophat work well with HTSeq-count. What is the problem? 

Thank you very much for your attentions and I look forward to the answers!!    

        

gff tool cufflinks rna-seq galaxy • 2.7k views
ADD COMMENTlink modified 3.6 years ago • written 3.6 years ago by Fang,Xiefan30

Hello Xiefan

It looks like you're using the right version of the genome build. According to Ensembl: "GRCm38, was released by the Genome Reference Consortium in January 2012... This assembly is used by UCSC to create their mm10 database". So mm10 and m38 should be compatible. 

You mention that you've followed an RNA-Seq tutorial for Galaxy. I'm not sure which one this is, but is there a reason that you are using Cufflinks and Cuffmerge to build your own GTF file? For most purposes, the Ensembl Mus_musculus.GRCm38.79.gtf should be suitable input as the transcripts file in Cuffdiff, without having to go through Cufflinks and Cuffdiff.

Even so, you should still have got something other than zeros for at least some of your genes, and since neither Cuffdiff nor HTSeq-count are giving anything, I suspect that it might be a problem with your merged BAM file. Was there a reason for splitting it up and merging all the resulting BAM files afterwards, or can you try remapping with the intact FASTQ file and see if that BAM file works? And what error were you getting from HTSeq-count when you tried the merged file - that might help identify if that is indeed the problem

Cheers

Mark

ADD REPLYlink written 3.6 years ago by Mark Crowe100
0
gravatar for Fang,Xiefan
3.6 years ago by
Fang,Xiefan30
United States
Fang,Xiefan30 wrote:

I figured out the problems regarding the annotation file. It turns out that I need to convert the ensembl GTF file by using the "Make Ensembl GTF compatible with Cufflinks" workflow. I used the converted annotation file for cufflinks and cuffmerge. I got the read counts and fold changes after cuffdiff. However, the gene_id was changed to XLOC IDs instead of Ensembl IDs (see below). How can I keep the ensembl ID for genes and transcripts? Thank you!!

For transcript differential expression testing:

test_id gene_id gene locus
TCONS_00000001 XLOC_000001 RP23-271O17.1 chr1:3073252-3074322
TCONS_00000002 XLOC_000002 Gm26206 chr1:3102015-3102125
TCONS_00000003 XLOC_000003 RP23-317L18.1 chr1:3205900-3671498
TCONS_00000004 XLOC_000004 Gm1992 chr1:3205900-3671498
TCONS_00000005 XLOC_000005 RP23-115I1.2 chr1:3205900-3671498

 

For gene differential expression testing:

test_id gene_id gene locus
XLOC_000011 XLOC_000011 RP23-37D15.2 chr1:4610470-4611406
XLOC_000057 XLOC_000057 Gm26348 chr1:10137570-10232670
XLOC_000139 XLOC_000139 Khdc1b chr1:21383555-21386373
XLOC_000152 XLOC_000152 RP23-255J24.2 chr1:24100624-24103424
XLOC_000174 XLOC_000174 Gm29128 chr1:31135225-31222684
ADD COMMENTlink written 3.6 years ago by Fang,Xiefan30
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 172 users visited in the last hour