Batch conversion of ID to gene symbol

Question: Batch conversion of ID to gene symbol

3.1 years ago by

United States

I'm an old school molecular biologist who studies gene expression but is quite new to bio-computing, I am feeling my way around using some ENCODE siRNA-RNA-seq data through web interfaces in GenomeSpace (like Galaxy, thanks!). I Managed to do an analysis of differential gene expression between 2 control and 2 siRNA replicates (CuffDiff). I got 50K ish genes with UCSC gene IDs, but couldn't manage to convert these to gene symbol/names I might be more familiar with using the tools a google search pointed me to. I re-ran the analysis with ENSEMBL genes then RefSeq genes to see what would change, and to see if this helped my ability to retrieve gene symbols. I got more genes from ENSEMBL (huh?) and otherwise unfortunately all I got was bubkis (meaning the cuffdiff worked, the most meaningful changes showed up in all conditions, but there were still no gene symbol entries on any of the outputs .

When I cut and paste any of these gene IDs/ENSEMBLE IDs into a google/PubMed search, they easliy locate the associated genes, but I want to convert the entire list to gene symbol not go through on by one. A different google search pointed me to some tools that seemed designed for that purpose (Biomart, UCSC table browser, NCBI DAVID) but after fumbling around I surmised that these tools don't/can't convert 50K genes at once, and to complicate the task that there are a good proportion of those IDs without a proper gene symbol. When I use a much smaller list (100-500 gene ID range) I was able to get some conversions, however this list didn't correspond to to the list I entered: They were not in the order that I entered them on the list and there were fewer/more entries than I entered, making merging them with my original gene list problematic/impossible without manually correlating all of these (exactly what I am trying to avoid).

I NEED ADVICE: Am I going about this all wrong? Is converting large lists of genesIDs to symbols not possible/or naive? If it is naive, then what is it that people in my position normally do? If it is possible, how do I get to the gene symbols for a gene expression analysis if the original output doesn't include them? If I am using the right tools, then how do I put in a list of genes, and get back a one to one correspondence of gene symbols in the order I entered them, with a skipped space where a partucluar gene ID has no corresponding gene symbol?

thanks

rna-seq • 4.8k views

ADD COMMENT • link •

modified 3.1 years ago by guangchuangyu • 0 • written 3.1 years ago by Kevin.Czaplinski • 0

3.1 years ago by

Jennifer Hillman Jackson ♦ 25k

United States

Jennifer Hillman Jackson ♦ 25k wrote:

Hello,

There are two solutions for this, where the first is tool specific and the other general bioinformatics.

1. The reference annotation GTF used with Cuffdiff could include the attribute gene_name. This would incorporate the label directly into the output files. One source of annotation files with this attribute (plus the p_id and tss_id necessary for full differential expression) is iGenomes. This is the quickest way to incorporate the information, for this pipeline.

2. For any data, any two column file that contains the transcript identifiers used in the original reference annotation file along with the corresponding gene name, gene symbol, or other identifier you might want to use, can be used with the tool "Join two Datasets side by side on a specified field". The idea is enter the output analysis file you have with one of the values (presumably the transcripts) and then also enter the mapping file. The tool "adds" the extra field containing the gene name/symbol to the first. Blank values can be skipped or filled in with a placeholder value (to preserve column ordering).

If using RefSeq track, then the gene name can be obtained from UCSC from the RefGene track's primary table (the value for the gene name is in the field called "name2"). If using the UCSC Genes track, the gene name is in the related table kgXref (this file will be a bit more complicated since it has a many-to-many data relationship, but it can be filtered). Biomart is another data source example and the transcript/gene name output can be custom selected and sent back to Galaxy easily for use with tools. Data sources almost always have a way to map from transcripts to gene name/symbol through a file.

Good luck with your project! Jen, Galaxy team

ADD COMMENT • link written 3.1 years ago by Jennifer Hillman Jackson ♦ 25k

Thank you Jennifer. RE option1: At some point, I realized the GTF files I got from UCSC didn't have the gene names, although from what I could see on the schemas they should have. Upon your suggestion I just tried the RefSeq track again and it didnt include the name2 field in the GTF file that I exported to galaxy. From the UCSC Genes track, when I select kgXref, exporting the file as a GTF file doesn't come up as an option (it seems I need that type for the cuffdiff input) and the file that is produced (tabular) has the field I want but this doesn't work with the cuffdiff input window. Can I convert that tabular file to a GTF file?

The transfer of the file from iGenomes is going painfully slow and it might be quite a while until I can try that one.

Still working on option 2....

ADD REPLY • link written 3.1 years ago by Kevin.Czaplinski • 0

Update: I took the tabular output file that I retrieved from UCSC and editing the metadata seemed to allow me to convert that to a GTF file. cuffdiff accepted that file but there was essentially no output from that analysis so I suppose that didn't work. Sill stumped.....

ADD REPLY • link written 3.1 years ago by Kevin.Czaplinski • 0

The UCSC GTF file does not contain the p_id, tss_id, or gene_name fields. It also will have gene_id and transcript_id set to the same value. SO, full Cuffdiff functionality is not possible. This is not UCSC's problem. It just means that the data is not a match for the Tuxedo suite. This is why the iGenomes datasets were created.

UCSC, Biomart, and other genomic data providers are fall back options. Each probably contain the data you want somewhere in the data structure (meaning is directly there or can be calculated from what is there) and that can be extracted ad formatted for the specific use (GTF for use in CuffDiff, tabular file for use with Join, other uses).

There are no automatic tools in Galaxy to do this. Instead, individual tools need to be used together to perform the manipulation. And it will be a different workflow for each source. This can be done, and is good to know how to do, but maybe use iGenomes first and see if that is enough.

A tool that converts common identifiers directly to gene name or symbol (as a list) would be useful. But whatever it is using for reference would have to be updated (daily, in the case of Refseq), plus you probably lose the content of the rest of the file unless this is a specific option.

By contrast, using a tabular file you provide with these two values mapped - and any analysis file you are already working with - is a better method if the goal is to preserve the original content and to use whatever version of the transcript->symbol mapping you want to. This is why I brought it up. This is an extremely valuable bioinformatic topic to understand. Knowing how to manipulate data to suite your needs is a powerful advantage. Galaxy has modular tools designed just for this purpose. Put these in a workflow and you have a custom tool. Share that workflow with others and you are essentially a tool author. No programming needed :)

Good luck with this, I think you are close. Go back and review the file differences and what they are useful for (Cuffdiff versus Join) if still confused about how these are used differently and why that is. But follow-up is also OK :)

ADD REPLY • link modified 3.1 years ago • written 3.1 years ago by Jennifer Hillman Jackson ♦ 25k

The human hg19 version of the iGenomes dataset is on http://usegalaxy.org under "Shared Data -> Data Libraries -> iGenomes. It is the "genes.gtf" file from the complete iGenomes tar bundle you are transferring. Maybe try this one out in Cuffdiff as a test, to even see if it does what you want?

The input for Join are tabular datasets. The output from Cuffidiff is in a version of tabluar format. GTF is technically tabular format, but not a great choice for this operation since the data you want to work with (if present at all) is in the 9th attribute field, mixed with other data, and not in a format that will permit value matching.

For the kgXref file, export this as "all fields from the selected table), then go in a limit which fields are actually printed out. The same goes for Refseq genes (or any other gene track). Start with the primary table, link in associated tables, check the boxes for the fields of interest (the transcript and the alternate gene name/symbol), then export that to Galaxy. If this seems complicated (and it might be the first time), follow the Table Browser tutorials and instructions. Biomart has the same concept for data extraction, just implemented differently (this site also has help to guide you).

If all of this seems new - then you can either experiment within Galaxy to manipulate the files extracted from public source (get all the data, avoid formats like GTF since these will be not useful for your purposes) or try using the iGenomes dataset directly with Cuffdiff. Even if all of this seems to technical, I still encourage you to use Galaxy as a way to learn these relational concepts in a UI setting. For this analysis or later on.

I am leaving a lot of detail out, but l think that is best.

ADD REPLY • link written 3.1 years ago by Jennifer Hillman Jackson ♦ 25k

I have no problem experimenting, that works best for me usually. It was a lot simpler with Galaxy, to be sure. Your clarifications have been helpful thanks. I was able to join my output with one of the tabular files that obtained from UCSC that had matching identifiers (your option2 in the first reply). That was quick and painless once I clicked on the "join two datasets" tool under "join, subtract and group". There were 2 iGenome GTF files on the Galaxy site, the first one didn't work in CuffDiff (accepted but gave no output). The other file seems to be working (running as I write this) but with 840K lines in this file it is at least a 10 times bigger job than my prior runs so who knows when it will be done.

ADD REPLY • link written 3.1 years ago by Kevin.Czaplinski • 0

Similar posts • Search »