Question: Gene annotation file for Cufflinks
0
gravatar for a.turtoi
2.3 years ago by
a.turtoi50
a.turtoi50 wrote:

Hi,

I am struggling to find a proper annotation file for the human genome, where the gene name (e.g. ACTB) will be included and not some x#*! I need this when running Cufflinks, to annotate the genes from my TopHat file. When I go to UCSF table webpage I can find the ref seq. annotation I need, but when I export it to Galaxy as GTF, only some columns are included - unfortunately excluding the gene name. If I custom export the file (from UCSF), I get what I want but I cannot have it as GTF file (obviously because the information in the custom exported file is not (fully) tab delimited)!

Any idea?

Thanks! Andrei

rna-seq cufflinks • 765 views
ADD COMMENTlink modified 2.3 years ago • written 2.3 years ago by a.turtoi50

Gene names aren't unique, which is why they aren't normally output. Why not use biomart to import the ID->Gene name conversion and annotate the cufflinks results with that? This is how we normally teach people to do things in our trainings, since it doesn't break downstream analyses.

ADD REPLYlink written 2.3 years ago by Devon Ryan1.9k
0
gravatar for a.turtoi
2.3 years ago by
a.turtoi50
a.turtoi50 wrote:

Dear Ryan, Thank you very much for your response. Do you know of a tutorial that would explain the process you suggested to a common biologist? Thanks for your advice. Andrei.

ADD COMMENTlink written 2.3 years ago by a.turtoi50

I'd have to find where our training material is on github. The general process is:

  1. Go to Ensembl biomart
  2. Choose "Ensembl Genes 85", followed by the appropriate species
  3. Click on "Attributes" on the left
  4. In the center window, expand "Genes" and select "Associated Gene Name". If your GTF file is using something other than Ensembl IDs, select whatever that is (it might be under "EXTERNAL").
  5. Save that to a file (N.B., you used to be able to access biomart from within Galaxy, but I'm not sure that works at the moment).
  6. Upload that file.
  7. Use the "Join two datasets" tool in Galaxy.

Something along those lines should work. Note that this same process can be used to annotate pretty much anything however you'd like.

ADD REPLYlink written 2.3 years ago by Devon Ryan1.9k
0
gravatar for a.turtoi
2.3 years ago by
a.turtoi50
a.turtoi50 wrote:

Ryan, thank you again for your help. I have managed to the step 4. I get the point that the two lists must have something unique in common in order to be matched. So looking at the Ref.seq file I got from UCSF website, only the Exon Start/End and start_codon Start/End are the unique features by which these files can be matched. However, in the Ensembl Genes 85, I cannot find Exon or start_codon start/end. Any suggestions?

ADD COMMENTlink written 2.3 years ago by a.turtoi50

UCSC, not UCSF, which is a bit further north :)

Can you please post a few lines from the history item you're trying to annotate? It should contain the IDs you mentioned in your original post.

BTW, there's an "Add reply" button that you can click on rather than adding new answers.

ADD REPLYlink written 2.3 years ago by Devon Ryan1.9k

sorry, it's getting late here in Japan :)

Please see the few lines hereunder:

Seqname Source Feature Start End Score Strand Frame Attributes chr1 hg19_refGene exon 66999252 66999355 0.000000 + . gene_id "NM_001308203"; transcript_id "NM_001308203"; chr1 hg19_refGene start_codon 67000042 67000044 0.000000 + . gene_id "NM_001308203"; transcript_id "NM_001308203"; chr1 hg19_refGene CDS 67000042 67000051 0.000000 + 0 gene_id "NM_001308203"; transcript_id "NM_001308203"; chr1 hg19_refGene exon 66999929 67000051 0.000000 + . gene_id "NM_001308203"; transcript_id "NM_001308203"; chr1 hg19_refGene CDS 67091530 67091593 0.000000 + 2 gene_id "NM_001308203"; transcript_id "NM_001308203"; chr1 hg19_refGene exon 67091530 67091593 0.000000 + . gene_id "NM_001308203"; transcript_id "NM_001308203"; chr1 hg19_refGene CDS 67098753 67098777 0.000000 + 1 gene_id "NM_001308203"; transcript_id "NM_001308203";

ADD REPLYlink written 2.3 years ago by a.turtoi50

sorry but it is really difficult to properly paste the tab delimited table..

Seqname Source Feature Start End Score Strand Frame Attributes chr1 hg19_refGene exon 66999252 66999355 0.000000 + . gene_id "NM_001308203"; chr1 hg19_refGene start_codon 67000042 67000044 0.000000 + . gene_id "NM_001308203"; chr1 hg19_refGene CDS 67000042 67000051 0.000000 + 0 gene_id "NM_001308203";

ADD REPLYlink modified 2.3 years ago • written 2.3 years ago by a.turtoi50

That's just a GTF file, annotating it with gene names isn't going to be terribly useful. What you want to annotate is the output of a test, which will be a tab-separated file.

ADD REPLYlink written 2.3 years ago by Devon Ryan1.9k

Thank you! I will try and if I encounter difficulties I will be posting back.

ADD REPLYlink written 2.3 years ago by a.turtoi50
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 175 users visited in the last hour