Question: Gene_short_name missing in CuffDiff output files
1
gravatar for m.maleszewska
3.7 years ago by
Netherlands
m.maleszewska20 wrote:

Dear All,

I ran Tuxedo at Galaxy (public, online), using TopHat2-Cufflinks-Cuffmerge-Cuffdiff. I expected my Cuffdiff output to contain gene_name, so that I could directly identify genes in downstream analyses. However, it seems to be missing and I only have a list of transcript ids (all isoforms) for each gene instead.

I used Reference genome at all steps (Cufflinks, Cuffdiff) downloaded from UCSC with Ensembl annotations. Now when I e.g. open a file with gene fpkm tracking, my columns tracking_id and gene_id are the same and contain XLOC ids. The column with gene_short_name contains a list of Ensembl transcript ids (although it's a gene file, it just puts all transcript ids belonging to that gene there).

So to me it looks like the columns are not filled appropriately. I wondered if somebody knows what I might have done wrong or has encountered a similar problem.

Below a fragment of a file - the gene_short_name column contains ENST ids in other files which i checked too. This is a gene fpkm tracking file...

tracking_id class_code nearest_ref_id gene_id gene_short_name
XLOC_000001 - - XLOC_000001  ENST00000450305,ENST00000456328,ENST00000515242,ENST00000518655
XLOC_000002 - - XLOC_000002 ENST00000469289,ENST00000473358,ENST00000607096
XLOC_000003 - - XLOC_000003 ENST00000594647,ENST00000606857
XLOC_000004 - - XLOC_000004 ENST00000492842
XLOC_000005 - - XLOC_000005 ENST00000335137
XLOC_000006 - - XLOC_000006 ENST00000442987
XLOC_000007 - - XLOC_000007 ENST00000496488
XLOC_000008 - - XLOC_000008 ENST00000419160,ENST00000423728,ENST00000425496,ENST00000431321,ENST00000431812,ENST00000432964,ENST00000440038,ENST00000440163,ENST00000445840,ENST00000453935,ENST00000455207,ENST00000455464,ENST00000514436,ENST00000599771,ENST00000601486,ENST00000601814,ENST00000608420

Any ideas on what might've gone wrong are very much welcome!

Monika

ADD COMMENTlink modified 3.7 years ago by Jennifer Hillman Jackson24k • written 3.7 years ago by m.maleszewska20
2
gravatar for Jennifer Hillman Jackson
3.7 years ago by
United States
Jennifer Hillman Jackson24k wrote:

Hello,

I can confirm that using the iGenomes version of the reference annotation will provide the attributes required by Cuffdiff and can help avoid identifier mismatch issues with reference genomes (when working on the public Main Galaxy instance at http://usegalaxy.org, and other Galaxies using UCSC reference genomes). For this one, be sure to use the version associated with hg19 - if that is your reference genome (it seems to be). Using the file with Tophat/2 and/or Cufflinks is optional, but do use it with CuffMerge before proceeding to Cuffdiff. 

The hg19 iGenomes file is available on the public Main Galaxy instance (http://usegalaxy.org) under: Shared Data -> Data Libraries -> iGenomes
(same as here: http://cufflinks.cbcb.umd.edu/igenomes.html)

For example protocols that make use of it, please see our wiki:
http://wiki.galaxyproject.org/Support#Tools_on_the_Main_server:_RNA-seq

Good luck with your project! Jen, Galaxy team

ADD COMMENTlink written 3.7 years ago by Jennifer Hillman Jackson24k

Thank you Jen! This is very helpful!

The file I found is however only ~100Mb, I hope that's correct. I will then try to use it with CuffMerge and then do CuffDiffs again.

Within CuffDiff, there is also an option to use Reference sequence, which I normally set at Locally cached (and then it automatically assigns hg_19 there), but I wondered if I should use some file there and what that should be? I have nothing in my History that even fits the format...

Best, Monika

ADD REPLYlink modified 3.7 years ago • written 3.7 years ago by m.maleszewska20

Glad that helped. Yes, for human on the public Main instance, you can use the local hg19 genome. There are actually a few variants of this dataset to pick from: Full (everything, including haplotype fragments and unmapped), Canonical (autosomes, X, Y, M), and Canonical Female (autosomes, X, M = avoids X/Y PAR alignment issues). Your choice - any will function correctly with this tool set, and downstream analysis tools (some tools will reassign the genome to just be "hg19", or you can do this directly, all are compatible). Best, Jen, Galaxy team

ADD REPLYlink written 3.7 years ago by Jennifer Hillman Jackson24k
0
gravatar for ahdee
3.7 years ago by
ahdee30
United States
ahdee30 wrote:

Hi I had the same issue but worked around it by using excel VLOOKUP function to cross reference name and ID.  So for example, 

 

=VLOOKUP(A31,gene_name5!A30:B61672,2,FALSE)

 

Where A31 is the ID and the next argument points a seperate worksheet name gene_name where the first row is the id and second is the gene name.  

 

Ahdee

ADD COMMENTlink written 3.7 years ago by ahdee30
0
gravatar for m.maleszewska
3.7 years ago by
Netherlands
m.maleszewska20 wrote:

Hi Ahdee,

Thank you for your tip!

I am worried though that if this is messed up, perhaps something else is wrong with my files too that I don't see... Also, cummeRbund uses this column, e.g. if I make a heatmap, it puts all the ENSTs at the side of it and it looks terrible (which I can fix by ommiting the names, but it's not really a solution), plus of course takes additional time to sort things out...

What I do now is, for a chosen CuffGeneSet, use featureNames(CuffGeneSet) to retrieve list of XLOC ids along with my ENSTs and then feed them to BioMart to get gene names... But truth is that with featureNames I should in fact automatically get gene names, not ENSTs - if only they were in the right column!

This plus all the downstream cummeRbund analyses which use that column, and I am just sorry to have to solve this while the files should just contain it already... And as I said: makes me worried if the rest of the contents are all right...

I would appreciate any further feedback!

NB: I actually ran my analysis a few times at Galaxy, because I have been changing some parameters as I learnt more on the way, but every time this column looked weird (or well, at first all i knew was that my heatmaps were weird in cummeRbund, then I found out why :))... Any ideas on what could be the reason are very welcome!

Monika

ADD COMMENTlink written 3.7 years ago by m.maleszewska20
0
gravatar for m.maleszewska
3.7 years ago by
Netherlands
m.maleszewska20 wrote:

Similar discussion and perhaps a solution (I started it again here, because the fora did not seem connected, unless I'm wrong?): https://www.biostars.org/p/107585/#108172 .

ADD COMMENTlink modified 3.7 years ago • written 3.7 years ago by m.maleszewska20
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 137 users visited in the last hour