Question: gene Symbol from cuffdiff output
gravatar for andrew.chess
3.6 years ago by
United States
andrew.chess0 wrote:


I received output from cuffdiff which included gene_id, gene and locus for each differentially-expressed gene, but I would like to have the gene Symbol.  For example, here is the information for one differentially-expressed gene:

Cuffdiff output:

gene_id                        gene                        locus

XLOC_063802            M97200            chr8:84901927-84905289


So I looked up the gene using ‘M97200’ and by looking at UCSC genome browser that it corresponds to a mouse mRNA from Genbank from the following gene:

Symbol               Name                                                     ID

Klf1                   Kruppel-like factor 1 (erythroid)            MGI:1342771


I have two questions:

1. Do you have a suggestion for how to use one (or more) of the three columns from the cuffdiff output I have to look up the gene Symbols?

2. Is there a setting I can enter when starting the cuffdiff (in the future) which will give me the gene Symbol as a column in my output?


rna-seq cufflinks • 1.4k views
ADD COMMENTlink modified 3.6 years ago by Jennifer Hillman Jackson25k • written 3.6 years ago by andrew.chess0
gravatar for Jennifer Hillman Jackson
3.6 years ago by
United States
Jennifer Hillman Jackson25k wrote:


There are a few ways to correlate gene symbols with the output, here are some examples:

1. Use a reference annotation dataset with the attribute "gene_name" included. The versions from iGenomes contain these, as examples (Shared Data -> Data Libaries -> iGenomes on Main has a few loaded/uncompressed). These are same-species annotations.

2. After running the analysis, join the file with a tabular annotation file that contains at least one unique attribute also present in the Cuffdiff output (transcript_id is good) and then the Gene symbol. Use the tool "Join two Datasets side by side on a specified field". To obtain the annotation file, you can build it up from UCSC output (for example, if using RefSeq transcripts, then the value "name2" in the complete "RefSeq Genes" track's primary table is a Gene Symbol. This will not be present in a BED export - instead export the whole table or at least the transcript_id and name2 field and send to Galaxy.

#2 can be same-species or cross-species. However you can make connections that fit your analysis goals, just put the data into the same file and use it in the join.

3. Another option for UCSC-based genomes: use coordinate overlap to annotate from other genome sources. The idea is to have the transcript (with an attached gene symbol) from the other target genome in a file where the transcript is mapped to that other reference genome. Then use "Lift-Over -> Convert genome coordinates" to convert the Cuffdiff coordinates to that other genome. And finish by looking for overlap between the two using the "Operate on Genomic intervals -< Join" tool.

In short, any way that you can link together the data from one file to another - by either a common field or by overlapping coordinates based on the same reference genome - can be used to pull in annotation. And it isn't limited to Gene Symbol. There are annotation tools that will link info from external files (in specific dataset formats, or from specific sources), and those can be very useful if they fit your data, but you can always join in your own using a method like those above.

Just keep in mind that a many-to-many relationship can exist, even within datasets based on the same reference genome. For an example of that at UCSC, examine hg19's "kgXref" table. 

Hopefully this helps, Jen, Galaxy team




ADD COMMENTlink written 3.6 years ago by Jennifer Hillman Jackson25k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 168 users visited in the last hour