Gene_short_name missing in CuffDiff output files

Question: Gene_short_name missing in CuffDiff output files

4.3 years ago by

Netherlands

Dear All,

I ran Tuxedo at Galaxy (public, online), using TopHat2-Cufflinks-Cuffmerge-Cuffdiff. I expected my Cuffdiff output to contain gene_name, so that I could directly identify genes in downstream analyses. However, it seems to be missing and I only have a list of transcript ids (all isoforms) for each gene instead.

I used Reference genome at all steps (Cufflinks, Cuffdiff) downloaded from UCSC with Ensembl annotations. Now when I e.g. open a file with gene fpkm tracking, my columns tracking_id and gene_id are the same and contain XLOC ids. The column with gene_short_name contains a list of Ensembl transcript ids (although it's a gene file, it just puts all transcript ids belonging to that gene there).

So to me it looks like the columns are not filled appropriately. I wondered if somebody knows what I might have done wrong or has encountered a similar problem.

Below a fragment of a file - the gene_short_name column contains ENST ids in other files which i checked too. This is a gene fpkm tracking file...

tracking_id	class_code	nearest_ref_id	gene_id	gene_short_name
XLOC_000001	-	-	XLOC_000001	ENST00000450305,ENST00000456328,ENST00000515242,ENST00000518655
XLOC_000002	-	-	XLOC_000002	ENST00000469289,ENST00000473358,ENST00000607096
XLOC_000003	-	-	XLOC_000003	ENST00000594647,ENST00000606857
XLOC_000004	-	-	XLOC_000004	ENST00000492842
XLOC_000005	-	-	XLOC_000005	ENST00000335137
XLOC_000006	-	-	XLOC_000006	ENST00000442987
XLOC_000007	-	-	XLOC_000007	ENST00000496488
XLOC_000008	-	-	XLOC_000008	ENST00000419160,ENST00000423728,ENST00000425496,ENST00000431321,ENST00000431812,ENST00000432964,ENST00000440038,ENST00000440163,ENST00000445840,ENST00000453935,ENST00000455207,ENST00000455464,ENST00000514436,ENST00000599771,ENST00000601486,ENST00000601814,ENST00000608420

Any ideas on what might've gone wrong are very much welcome!

Monika

rna-seq gene_name cufflinks cuffdiff • 3.0k views

ADD COMMENT • link •

modified 4.3 years ago by Jennifer Hillman Jackson ♦ 25k • written 4.3 years ago by m.maleszewska • 20

4.3 years ago by

Jennifer Hillman Jackson ♦ 25k

United States

Jennifer Hillman Jackson ♦ 25k wrote:

Hello,

I can confirm that using the iGenomes version of the reference annotation will provide the attributes required by Cuffdiff and can help avoid identifier mismatch issues with reference genomes (when working on the public Main Galaxy instance at http://usegalaxy.org, and other Galaxies using UCSC reference genomes). For this one, be sure to use the version associated with hg19 - if that is your reference genome (it seems to be). Using the file with Tophat/2 and/or Cufflinks is optional, but do use it with CuffMerge before proceeding to Cuffdiff.

The hg19 iGenomes file is available on the public Main Galaxy instance (http://usegalaxy.org) under: Shared Data -> Data Libraries -> iGenomes
(same as here: http://cufflinks.cbcb.umd.edu/igenomes.html)

For example protocols that make use of it, please see our wiki:
http://wiki.galaxyproject.org/Support#Tools_on_the_Main_server:_RNA-seq

Good luck with your project! Jen, Galaxy team

ADD COMMENT • link written 4.3 years ago by Jennifer Hillman Jackson ♦ 25k

Thank you Jen! This is very helpful!

The file I found is however only ~100Mb, I hope that's correct. I will then try to use it with CuffMerge and then do CuffDiffs again.

Within CuffDiff, there is also an option to use Reference sequence, which I normally set at Locally cached (and then it automatically assigns hg_19 there), but I wondered if I should use some file there and what that should be? I have nothing in my History that even fits the format...

Best, Monika

ADD REPLY • link modified 4.3 years ago • written 4.3 years ago by m.maleszewska • 20

Glad that helped. Yes, for human on the public Main instance, you can use the local hg19 genome. There are actually a few variants of this dataset to pick from: Full (everything, including haplotype fragments and unmapped), Canonical (autosomes, X, Y, M), and Canonical Female (autosomes, X, M = avoids X/Y PAR alignment issues). Your choice - any will function correctly with this tool set, and downstream analysis tools (some tools will reassign the genome to just be "hg19", or you can do this directly, all are compatible). Best, Jen, Galaxy team

ADD REPLY • link written 4.3 years ago by Jennifer Hillman Jackson ♦ 25k

4.3 years ago by

ahdee • 30

United States

ahdee • 30 wrote:

Hi I had the same issue but worked around it by using excel VLOOKUP function to cross reference name and ID. So for example,

=VLOOKUP(A31,gene_name5!A30:B61672,2,FALSE)

Where A31 is the ID and the next argument points a seperate worksheet name gene_name where the first row is the id and second is the gene name.

Ahdee

ADD COMMENT • link written 4.3 years ago by ahdee • 30

4.3 years ago by

m.maleszewska • 20

Netherlands

m.maleszewska • 20 wrote:

Hi Ahdee,

Thank you for your tip!

I am worried though that if this is messed up, perhaps something else is wrong with my files too that I don't see... Also, cummeRbund uses this column, e.g. if I make a heatmap, it puts all the ENSTs at the side of it and it looks terrible (which I can fix by ommiting the names, but it's not really a solution), plus of course takes additional time to sort things out...

What I do now is, for a chosen CuffGeneSet, use featureNames(CuffGeneSet) to retrieve list of XLOC ids along with my ENSTs and then feed them to BioMart to get gene names... But truth is that with featureNames I should in fact automatically get gene names, not ENSTs - if only they were in the right column!

This plus all the downstream cummeRbund analyses which use that column, and I am just sorry to have to solve this while the files should just contain it already... And as I said: makes me worried if the rest of the contents are all right...

I would appreciate any further feedback!

NB: I actually ran my analysis a few times at Galaxy, because I have been changing some parameters as I learnt more on the way, but every time this column looked weird (or well, at first all i knew was that my heatmaps were weird in cummeRbund, then I found out why :))... Any ideas on what could be the reason are very welcome!

Monika

ADD COMMENT • link written 4.3 years ago by m.maleszewska • 20

4.3 years ago by

m.maleszewska • 20

Netherlands

m.maleszewska • 20 wrote:

Similar discussion and perhaps a solution (I started it again here, because the fora did not seem connected, unless I'm wrong?): https://www.biostars.org/p/107585/#108172 .

ADD COMMENT • link modified 4.3 years ago • written 4.3 years ago by m.maleszewska • 20

Please log in to add an answer.

Similar posts • Search »