Question: DIfferent outputs from Cuffdiff using different genome files
I am analyzing gene expression changes after siRNA knockdown (2 relicates each, control and gene knockdown). I ran it three different times using different genome GTF files.

Within the Top50 fold changes in gene expression (looking only at changes  that are scored as significant in the cuffdiff output) only about 20-30% of these are present on all three lists.Why do three different input genome files give me quite different lists of changes?

What information can I use when these lists are quite variable?


rna-seq
The specific transcripts and genes included in the reference annotation influence how these tools cluster the data. The content can be different between annotation sources based on the rules applied to build the transcripts, cluster them into genes, and/or the attributes in the file itself. Key attributes: 1) minimally the presence of tss_id and p_id and ideally gene_name 2) transcripts actually clustered into genes - if transcript_id and gene_id are the same value, then there is no gene clustering.

The best annotation file available for your target genome is one that contains most or all of these attributes. If there are several such files available, then review how the transcripts and genes are constructed and decide which is the best match for your experiment. Some sources have stricter rules than others and some contain predictions (which may or may not be desirable). 

iGenomes GTF datasets are an example of annotation files with all of the attributes Cuffdiff can utilize. 

Best, Jen, Galaxy team

