Gene annotation file for Cufflinks

Question: Gene annotation file for Cufflinks

2.3 years ago by

a.turtoi • 50 wrote:

Hi,

I am struggling to find a proper annotation file for the human genome, where the gene name (e.g. ACTB) will be included and not some x#*! I need this when running Cufflinks, to annotate the genes from my TopHat file. When I go to UCSF table webpage I can find the ref seq. annotation I need, but when I export it to Galaxy as GTF, only some columns are included - unfortunately excluding the gene name. If I custom export the file (from UCSF), I get what I want but I cannot have it as GTF file (obviously because the information in the custom exported file is not (fully) tab delimited)!

Any idea?

Thanks! Andrei

rna-seq cufflinks • 765 views

ADD COMMENT • link •

modified 2.3 years ago • written 2.3 years ago by a.turtoi • 50

Gene names aren't unique, which is why they aren't normally output. Why not use biomart to import the ID->Gene name conversion and annotate the cufflinks results with that? This is how we normally teach people to do things in our trainings, since it doesn't break downstream analyses.

ADD REPLY • link written 2.3 years ago by Devon Ryan • 1.9k

2.3 years ago by

a.turtoi • 50

a.turtoi • 50 wrote:

Dear Ryan, Thank you very much for your response. Do you know of a tutorial that would explain the process you suggested to a common biologist? Thanks for your advice. Andrei.

ADD COMMENT • link written 2.3 years ago by a.turtoi • 50

I'd have to find where our training material is on github. The general process is:

Go to Ensembl biomart
Choose "Ensembl Genes 85", followed by the appropriate species
Click on "Attributes" on the left
In the center window, expand "Genes" and select "Associated Gene Name". If your GTF file is using something other than Ensembl IDs, select whatever that is (it might be under "EXTERNAL").
Save that to a file (N.B., you used to be able to access biomart from within Galaxy, but I'm not sure that works at the moment).
Upload that file.
Use the "Join two datasets" tool in Galaxy.

Something along those lines should work. Note that this same process can be used to annotate pretty much anything however you'd like.

ADD REPLY • link written 2.3 years ago by Devon Ryan • 1.9k

2.3 years ago by

a.turtoi • 50

a.turtoi • 50 wrote:

Ryan, thank you again for your help. I have managed to the step 4. I get the point that the two lists must have something unique in common in order to be matched. So looking at the Ref.seq file I got from UCSF website, only the Exon Start/End and start_codon Start/End are the unique features by which these files can be matched. However, in the Ensembl Genes 85, I cannot find Exon or start_codon start/end. Any suggestions?

ADD COMMENT • link written 2.3 years ago by a.turtoi • 50

UCSC, not UCSF, which is a bit further north :)

Can you please post a few lines from the history item you're trying to annotate? It should contain the IDs you mentioned in your original post.

BTW, there's an "Add reply" button that you can click on rather than adding new answers.

ADD REPLY • link written 2.3 years ago by Devon Ryan • 1.9k

sorry, it's getting late here in Japan :)

Please see the few lines hereunder:

Seqname Source Feature Start End Score Strand Frame Attributes chr1 hg19_refGene exon 66999252 66999355 0.000000 + . gene_id "NM_001308203"; transcript_id "NM_001308203"; chr1 hg19_refGene start_codon 67000042 67000044 0.000000 + . gene_id "NM_001308203"; transcript_id "NM_001308203"; chr1 hg19_refGene CDS 67000042 67000051 0.000000 + 0 gene_id "NM_001308203"; transcript_id "NM_001308203"; chr1 hg19_refGene exon 66999929 67000051 0.000000 + . gene_id "NM_001308203"; transcript_id "NM_001308203"; chr1 hg19_refGene CDS 67091530 67091593 0.000000 + 2 gene_id "NM_001308203"; transcript_id "NM_001308203"; chr1 hg19_refGene exon 67091530 67091593 0.000000 + . gene_id "NM_001308203"; transcript_id "NM_001308203"; chr1 hg19_refGene CDS 67098753 67098777 0.000000 + 1 gene_id "NM_001308203"; transcript_id "NM_001308203";

ADD REPLY • link written 2.3 years ago by a.turtoi • 50

sorry but it is really difficult to properly paste the tab delimited table..

Seqname Source Feature Start End Score Strand Frame Attributes chr1 hg19_refGene exon 66999252 66999355 0.000000 + . gene_id "NM_001308203"; chr1 hg19_refGene start_codon 67000042 67000044 0.000000 + . gene_id "NM_001308203"; chr1 hg19_refGene CDS 67000042 67000051 0.000000 + 0 gene_id "NM_001308203";

ADD REPLY • link modified 2.3 years ago • written 2.3 years ago by a.turtoi • 50

That's just a GTF file, annotating it with gene names isn't going to be terribly useful. What you want to annotate is the output of a test, which will be a tab-separated file.

ADD REPLY • link written 2.3 years ago by Devon Ryan • 1.9k

Thank you! I will try and if I encounter difficulties I will be posting back.

ADD REPLY • link written 2.3 years ago by a.turtoi • 50

Please log in to add an answer.

Similar posts • Search »