Hello, apologies if this is a dumb question, I am relatively new to RNAseq. I have processed 9 samples through TopHat and Cufflinks, but as none of my samples have replicates I won't be using Cuffdiff to test for differential expression. My understanding is that my best option is to compare raw gene expression values through FPKM output from Cufflinks. I want to compare gene FPKM values across two Cufflinks gene expression files, and I am unsure as to how I can do that, any suggestions? Thanks in advance for any help, I'd really appreciate it!
Hello,
Review the tool Cuffcompare to discover what is in common and what is not between the experiments.
From there, specific transcript sets can be extracted by using a Join on the common transcript/gene identifier. FPKM between experiments will be placed on the same line in the output, making it easier for statistics/graphs to be generated.
If you are not using reference annotation, a less precise comparison would be done by examining transcripts with overlapping genome footprints. Using tools in the group Operate on Genomic Intervals is one to perform a merge between experiments where only overlapping coordinates are in common.
http://cole-trapnell-lab.github.io/cufflinks/manual/
Hopefully this helps, Jen, Galaxy team
Hello, thank you for your response!
I ran cuffcompare on my 9 cufflinks assembled transcripts files, but I am not sure that I used it correctly. I am somewhat confused by the reference annotation portion of cuffcompare, this is meant for a reference transcriptome, correct? If all I want is to be able to search for a gene, and see the FPKM values across my time course of samples, do I need a reference transcriptome? I did use the "sequence data" option with a reference genome.
When using the Join tool do I use that on my cuffcompare combined transcripts output, or on my cufflinks assembled transcripts output? My best guess would be to join the cuffcompare combined transcripts output joined against a reference genome, selecting the columns that have gene identifiers? I am not entirely sure how to get to the point where I am looking at gene FPKM values. My ideal data set is basically a column of gene ID's and then one column per sample with the FPKM values. I will keep fiddling with it, thanks for your help!
Perform the Join using the Cufflinks gene expression output using the gene_id field as the common field. From there, you can use Cut to reduce the data to just the FPKM values.
The other suggestions were extra. A reference transcriptome is not required. The "sequence data" option can be used with both transcriptomes and genomes. All pre-cached reference databases on http://usegalaxy.org represent a complete "genome". (For example, "hg38" is the complete reference genome, not the set of known transcripts from that human genome build.)
When doing the analysis, be aware that there can be multiple transcripts per gene, which may confuse the data a bit with some operations.
Thanks! Jen
Gotcha, that makes a lot more sense. I have begun processing my data, and I've noticed that the gene_id field is filled with values in the format CUFF.# , where # is some number. Is there some way to change this output so that I can tell what gene is being referred to, or is there a way to interpret the CUFF.# values? Thank you so much, you have already saved me an enormous amount of time!
-Will