Question: Identify Different Number Of Differential Expressed Genes Using Ensemble Or Reseq Gtf
0
gravatar for ericliaowei@gmail.com
5.9 years ago by
ericliaowei@gmail.com70 wrote:
Hi all, I am analyzing significant differential expressed genes for a pair of normal V.S tumor, using Cuffdiff 2.0.2. I noticed that by using ensemble GTF and refseq GTF, the results showed a big difference on the number of genes being significant expressed. For ensemble GTF, there are only 250 genes differential expressed. But for refseq GTF, there are about 1000 genes. I am running these data on Galaxy server and with the same workflow. Can anyone explain what is going on here? so which result should I trust? Thanks. -- Wei Liao Research Scientist, Brentwood Biomedical Research Institute 16111 Plummer St. Bldg 7, Rm D-122 North Hills, CA 91343 818-891-7711 ext 7645
galaxy • 2.0k views
ADD COMMENTlink modified 5.9 years ago by Jennifer Hillman Jackson25k • written 5.9 years ago by ericliaowei@gmail.com70
0
gravatar for Jennifer Hillman Jackson
5.9 years ago by
United States
Jennifer Hillman Jackson25k wrote:
Hello Wei, The contents of the reference GTF files (original, before analysis) will probably provide some explanation. My guess is that GTF files have different contents and are not directly comparable - RefSeq with full transcripts and Ensembl with full transcripts + potentially partial predictions and/or predicted splice sites. Alternative versions of each may be available. When possible, you most likely will want to be using a reference GTF file that represents complete transcripts. I don't know what genome you are using, but you can check the source notes at Ensembl (& NCBI) to find out what each annotation build contains. A raw count on the number of entries in the GTF files can also be a clue - if greatly different, then you very likely have different populations in the two files. Good luck with your project! Jen Galaxy team -- Jennifer Hillman-Jackson Galaxy Support and Training http://galaxyproject.org
ADD COMMENTlink written 5.9 years ago by Jennifer Hillman Jackson25k
Hi, Another approach you can try is to use DESeq or EdgeR from Bioconductor to assess differential expression. I personally like these two methods LOTS better than Cuff* mainly because they are a lot closer to tried and true statistical methods developed for microarrays. I esp. like how both methods let you test different factors. For example, if you are testing a treatment (drug or no drug) and a genotype (mutant vs. wildtype) you can find out which genes' expression depends on having a wild-type copy of the gene by testing an "interaction term." Both methods start with simple counts - numbers of reads overlapping annotated genes. Probably there is a Galaxy workflow that can calculate counts of reads per gene, but I don't know if Galaxy currently incorporates R/Bioconductor tools. If you can get Galaxy to calculate reads per gene, then you can then download the file and run it through edgeR or DESeq. R is free but it does take some time to master it. But it is incredibly powerful and well worth the effort! To get started with R, I recommend doing the free-of-charge O'Reilly Press "try R" tutorial which is on-line here: http://tryr.codeschool.com/ I hope this will be helpful! Best wishes, Ann Loraine Ann Loraine, Ph.D. Associate Professor Department of Bioinformatics and Genomics University of North Carolina at Charlotte North Carolina Research Campus 600 Laureate Way Kannapolis, NC 28081 704-250-5750 aloraine@uncc.edu http://www.transvar.org http://www.bioviz.org http://www.uncc.edu
ADD REPLYlink written 5.9 years ago by Loraine, Ann60
Hi, Jennifer, Thanks for your reply! My raw RNA-seq data was mapped to the hg19 without reference GTF in our local instance. In order to troubleshoot, I tried the following: (1) use Tophat to map data again with hg19, and iGenome ensembl.GTF, then use Cuffdiff to find differential expressed genes. There are still 250 significant genes. (2) use Tophat to map data again with hg19 without reference GTF, use cufflink with Homo_sapiens.GRCh37.69.gtf downloaded from ensembl.org. Same results with 250 significant genes. (3) use Tophat to map data again with hg19 without reference GTF, use cufflink with refseq refFlat.GTF, The results are ~1000 significant genes. (4) use Tophat to map data again with hg19 without reference GTF, use cufflink with refseq iGenome refseq.GTF, The results are ~1000 significant genes. However, I need to confirm what release or version is the hg19 reference genome I am using. Do you think the different results are caused by mapping to different hg19 genome? if so, how can you find a match of hg19 with reference to a correct GTF? I thought the use of ensembl or refseq would not affect the results in cuffdiff step. These reference GTF file (refFlat.GTF, iGenome refseq.GTF, or iGenome ensembl.GTF) should represents complete transcripts. Wei -- Wei Liao Research Scientist, Brentwood Biomedical Research Institute 16111 Plummer St. Bldg 7, Rm D-122 North Hills, CA 91343 818-891-7711 ext 7645
ADD REPLYlink written 5.9 years ago by ericliaowei@gmail.com70
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 168 users visited in the last hour