Question: RNA-seq analysis for a small number of genes
gravatar for kr81
3.2 years ago by
kr8110 wrote:

Hi all,

I am new to RNA seq analysis and want to look at the expression of a small number of genes (8) in some publicly available RNA seq datasets.

I came up with a simple method that avoids me having to align the RNA seq reads to the genome and do a full tophat/cufflinks analysis (or similar). Briefly, what I did was: Download and QC filter SRA dataset > map the reads to a multi-fasta file containing the exonic sequences for my genes of interest with bowtie2 > filter out hits with MAPQ<30 > obtain FPKM values using eXpress.

Does anyone have any thoughts on whether my method is valid or not? My main concern is that mapping reads to such a low complexity reference may artificially inflate the number of reads that map to my genes of interest and bias the FPKM values?

Thanks in advance :)

rna-seq bowtie2 fpkm express • 1.0k views
ADD COMMENTlink modified 3.2 years ago by Jennifer Hillman Jackson25k • written 3.2 years ago by kr8110
gravatar for Jennifer Hillman Jackson
3.2 years ago by
United States
Jennifer Hillman Jackson25k wrote:


The statistics will be different from what would be obtained by performing a genome-wide analysis, but the relative values within the experiment should still be comparable (and informative). Just keep that in mind.

A slightly better approach would be to consider mapping to the entire genome (a "canonical" version might be a good choice - to eliminate all the noise from unmapped, haplotypes, and other fragments), but then restricting analysis at the stage where a reference annotation dataset is introduced. Cuffdiff will only report results for the transcripts (representing genes) that it is informed about through this annotation. You do not need to include the entire transcriptome (or exome for that matter) if the background is not needed. You can also restrict by chromosome or region. And you do not need to do discovery, which means skipping Cufflinks, Cuffmerge and just moving from Tophat to Cuffdiff using a reference annotation GTF/GFF3 dataset of the those genes of interest. 

Short answer, the some values will be skewed when inputs are limited deliberately, but it may not matter for certain types of analysis.

Best, Jen, Galaxy team

ADD COMMENTlink written 3.2 years ago by Jennifer Hillman Jackson25k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 175 users visited in the last hour