Using different reference genome yields different results in RNA-seq data analysis

Question: Using different reference genome yields different results in RNA-seq data analysis

3.3 years ago by

k.ip • 10

Australia

k.ip • 10 wrote:

Hello, I've got some questions with my RNA-seq results generated by galaxy.

For the experiment, I have a control group and a KO group (n=3), and I tried to look at the differential expressing genes between the two.

For data processes, i just simply mapped my reads back to the mm10 mouse genome, and then used the aligned files for Cuffdiff. For Cuffdiff, I analysed the data in two ways as described below:

1. First time, I ran cuffdiff on my RNA-seq data with one chromosome at a time (generated by UCSC) as the transcripts reference, and I generated 22 files.

2. Second time, I used the igenome_genes as the transcripts reference.

However, when i compared the result of the two, using method 1, it has picked up a lot more significant genes (~40 genes more) than method 2, and some of those genes found in method 1 is different to method 2, I just cant figure out what is the reason behind that gives such different results?

Sorry if my description isn't clear enough.

Thanks for the help,

rna-seq galaxy • 977 views

ADD COMMENT • link •

modified 3.3 years ago • written 3.3 years ago by k.ip • 10

3.3 years ago by

Jennifer Hillman Jackson ♦ 25k

United States

Jennifer Hillman Jackson ♦ 25k wrote:

Hello,

These results are not really comparable as the contents of the reference annotation will impact the FPKM values and the differential expression tests based on those values. Depending on which transcript GTF you are using (Known Genes, Refseq, others), the transcripts contained probably differ from the iGenomes transcript set. To complicate it a bit more, UCSC's GTF datasets had the same value populated for gene_id and transcript_id when exported from the Table Browser last time I checked. And perhaps the most important difference in the reference annotation choices is that iGenomes GTF datasets contain the extra attributes utilized by Cuffdiff to produce the full complement of statistics, specifically, tss_id and p_id (plus gene_id). (The first two are used to define transcripts and genes at the most fundamental level, the latter is just a common label that is convenient for downstream analysis).

Overall, this is a large number of factors that differ, so different results is not surprising. If the goal is to isolate results by chromosome, try filtering after mapping against the full genome, using the GTF/GFF3 annotation that you feel is best based on content, for all tests. If you subset this way, differences will still appear due to the transcript content, but they will be reduced and perhaps not significant. (Which to use is your decision, but iGenomes is a common choice as it was designed specifically to meet the optimal input requirements for this tool suite).

Hopefully this helps! But please see the Cuffdiff manual for more details about these factors (and other posts here - search by the keyword "Cuffdiff") to better understand how the input content is used by the tool. Best, Jen, Galaxy team

ADD COMMENT • link modified 3.3 years ago • written 3.3 years ago by Jennifer Hillman Jackson ♦ 25k

3.3 years ago by

k.ip • 10

Australia

k.ip • 10 wrote:

Your explanation is very clear and helpful! Thank you so much.

ADD COMMENT • link written 3.3 years ago by k.ip • 10

Please log in to add an answer.

Similar posts • Search »