Question: Gene annotation file for Cufflinks
2.3 years ago
a.turtoi50 wrote:


I am struggling to find a proper annotation file for the human genome, where the gene name (e.g. ACTB) will be included and not some x#*! I need this when running Cufflinks, to annotate the genes from my TopHat file. When I go to UCSF table webpage I can find the ref seq. annotation I need, but when I export it to Galaxy as GTF, only some columns are included - unfortunately excluding the gene name. If I custom export the file (from UCSF), I get what I want but I cannot have it as GTF file (obviously because the information in the custom exported file is not (fully) tab delimited)!

Any idea?

Thanks! Andrei

Gene names aren't unique, which is why they aren't normally output. Why not use biomart to import the ID->Gene name conversion and annotate the cufflinks results with that? This is how we normally teach people to do things in our trainings, since it doesn't break downstream analyses.

2.3 years ago
a.turtoi50 wrote:

Dear Ryan, Thank you very much for your response. Do you know of a tutorial that would explain the process you suggested to a common biologist? Thanks for your advice. Andrei.

I'd have to find where our training material is on github. The general process is:

  1. Go to Ensembl biomart
  2. Choose "Ensembl Genes 85", followed by the appropriate species
  3. Click on "Attributes" on the left
  4. In the center window, expand "Genes" and select "Associated Gene Name". If your GTF file is using something other than Ensembl IDs, select whatever that is (it might be under "EXTERNAL").
  5. Save that to a file (N.B., you used to be able to access biomart from within Galaxy, but I'm not sure that works at the moment).
  6. Upload that file.
  7. Use the "Join two datasets" tool in Galaxy.

Something along those lines should work. Note that this same process can be used to annotate pretty much anything however you'd like.

2.3 years ago
a.turtoi50 wrote:

Ryan, thank you again for your help. I have managed to the step 4. I get the point that the two lists must have something unique in common in order to be matched. So looking at the Ref.seq file I got from UCSF website, only the Exon Start/End and start_codon Start/End are the unique features by which these files can be matched. However, in the Ensembl Genes 85, I cannot find Exon or start_codon start/end. Any suggestions?

UCSC, not UCSF, which is a bit further north :)

Can you please post a few lines from the history item you're trying to annotate? It should contain the IDs you mentioned in your original post.

BTW, there's an "Add reply" button that you can click on rather than adding new answers.

sorry, it's getting late here in Japan :)

Please see the few lines hereunder:

Seqname Source Feature Start End Score Strand Frame Attributes chr1 hg19_refGene exon 66999252 66999355 0.000000 + . gene_id "NM_001308203"; transcript_id "NM_001308203"; chr1 hg19_refGene start_codon 67000042 67000044 0.000000 + . gene_id "NM_001308203"; transcript_id "NM_001308203"; chr1 hg19_refGene CDS 67000042 67000051 0.000000 + 0 gene_id "NM_001308203"; transcript_id "NM_001308203"; chr1 hg19_refGene exon 66999929 67000051 0.000000 + . gene_id "NM_001308203"; transcript_id "NM_001308203"; chr1 hg19_refGene CDS 67091530 67091593 0.000000 + 2 gene_id "NM_001308203"; transcript_id "NM_001308203"; chr1 hg19_refGene exon 67091530 67091593 0.000000 + . gene_id "NM_001308203"; transcript_id "NM_001308203"; chr1 hg19_refGene CDS 67098753 67098777 0.000000 + 1 gene_id "NM_001308203"; transcript_id "NM_001308203";

sorry but it is really difficult to properly paste the tab delimited table..

Seqname Source Feature Start End Score Strand Frame Attributes chr1 hg19_refGene exon 66999252 66999355 0.000000 + . gene_id "NM_001308203"; chr1 hg19_refGene start_codon 67000042 67000044 0.000000 + . gene_id "NM_001308203"; chr1 hg19_refGene CDS 67000042 67000051 0.000000 + 0 gene_id "NM_001308203";

That's just a GTF file, annotating it with gene names isn't going to be terribly useful. What you want to annotate is the output of a test, which will be a tab-separated file.

Thank you! I will try and if I encounter difficulties I will be posting back.

