gene Symbol from cuffdiff output

Question: gene Symbol from cuffdiff output

3.6 years ago by

United States

Hi.

I received output from cuffdiff which included gene_id, gene and locus for each differentially-expressed gene, but I would like to have the gene Symbol. For example, here is the information for one differentially-expressed gene:

Cuffdiff output:

gene_id gene locus

XLOC_063802 M97200 chr8:84901927-84905289

So I looked up the gene using ‘M97200’ and by looking at UCSC genome browser that it corresponds to a mouse mRNA from Genbank from the following gene:

Symbol Name ID

Klf1 Kruppel-like factor 1 (erythroid) MGI:1342771

I have two questions:

1. Do you have a suggestion for how to use one (or more) of the three columns from the cuffdiff output I have to look up the gene Symbols?

2. Is there a setting I can enter when starting the cuffdiff (in the future) which will give me the gene Symbol as a column in my output?

Thanks!

rna-seq cufflinks • 1.4k views

ADD COMMENT • link •

modified 3.6 years ago by Jennifer Hillman Jackson ♦ 25k • written 3.6 years ago by andrew.chess • 0

3.6 years ago by

Jennifer Hillman Jackson ♦ 25k

United States

Jennifer Hillman Jackson ♦ 25k wrote:

Hello,

There are a few ways to correlate gene symbols with the output, here are some examples:

1. Use a reference annotation dataset with the attribute "gene_name" included. The versions from iGenomes contain these, as examples (Shared Data -> Data Libaries -> iGenomes on Main has a few loaded/uncompressed). These are same-species annotations.

2. After running the analysis, join the file with a tabular annotation file that contains at least one unique attribute also present in the Cuffdiff output (transcript_id is good) and then the Gene symbol. Use the tool "Join two Datasets side by side on a specified field". To obtain the annotation file, you can build it up from UCSC output (for example, if using RefSeq transcripts, then the value "name2" in the complete "RefSeq Genes" track's primary table is a Gene Symbol. This will not be present in a BED export - instead export the whole table or at least the transcript_id and name2 field and send to Galaxy.

#2 can be same-species or cross-species. However you can make connections that fit your analysis goals, just put the data into the same file and use it in the join.

3. Another option for UCSC-based genomes: use coordinate overlap to annotate from other genome sources. The idea is to have the transcript (with an attached gene symbol) from the other target genome in a file where the transcript is mapped to that other reference genome. Then use "Lift-Over -> Convert genome coordinates" to convert the Cuffdiff coordinates to that other genome. And finish by looking for overlap between the two using the "Operate on Genomic intervals -< Join" tool.

In short, any way that you can link together the data from one file to another - by either a common field or by overlapping coordinates based on the same reference genome - can be used to pull in annotation. And it isn't limited to Gene Symbol. There are annotation tools that will link info from external files (in specific dataset formats, or from specific sources), and those can be very useful if they fit your data, but you can always join in your own using a method like those above.

Just keep in mind that a many-to-many relationship can exist, even within datasets based on the same reference genome. For an example of that at UCSC, examine hg19's "kgXref" table.

Hopefully this helps, Jen, Galaxy team

ADD COMMENT • link written 3.6 years ago by Jennifer Hillman Jackson ♦ 25k

Similar posts • Search »