Question: How To Transfer Gene Id Into Protein Id
0
gravatar for Li, Jilong (MU-Student)
7.1 years ago by
Hi, I have some refseq gene id, like NM_*****. How can I transfer these gene id into protein id, like NP_****? Thank you very much! Victor
• 2.3k views
ADD COMMENTlink modified 7.1 years ago by Jennifer Hillman Jackson25k • written 7.1 years ago by Li, Jilong (MU-Student)80
0
gravatar for Jennifer Hillman Jackson
7.1 years ago by
United States
Jennifer Hillman Jackson25k wrote:
Hello, If the reference genome is in UCSC and has a RefSeq track, then you can extract a file with the transcript and protein identifiers from the Table Browser called "refLink" and subset it for rows in your query RefSeq transcript identifiers. If the RefSeq data is at BioMart or another source, a similar path to the one I outline below will work with some modifications, it all depends on the file format, but Galaxy's tools can manipulate data is just about every way you will need. Using a transcript identifier query, subset protein identifiers in a UCSC RefSeq track: A. Load your list of NM* identifiers ("Get Data -> Upload). - set the file format to "tabular" (use "pencil" icon to "Edit Attributes -> Change data type") if needed. B. Load RefSeq id mapping data with "Get Data -> UCSC Main" and set the form parameters as needed, choosing the track "RefSeq Genes" and the table "refLink". Make sure the region is the entire genome. Send to Galaxy formatted as-is (tabular). B. Next, cut columns 3 and 4 out of the table with tool "Text Manipulation ->Cut" and the options "c3,c4". C. OPTIONAL, if you want the full list of coding RefSeqs for another purpose... remove the non-coding RefSeqs with the tool "Filter and Sort -> Select" and the options "that: NOT Matching" and "the pattern: ^NR_.*$". Be sure to enter the regular expression '^NR_.*$' without any quotes. D. Perform a join using "Join, Subtract and Group -> Compare two Datasets" with the options>: - "Compare: <file of="" trans="" and="" prot="" id,="" filtered="" or="" not="">" - "Using column: c1" where c1 is the trans ids - "against: <file of="" trans="" ids="">" - "and column: c1" where c1 is the trans ids - "To find: Matching rows of first dataset" E. Result dataset is a two column tabular file: transcript id <tab> protein id Hopefully this helps you and others who are doing a similar task. If you think you will be doing this a lot, be sure to consider extracting the steps into a workflow. Thanks for using Galaxy, Jen Galaxy team -- Jennifer Jackson http://usegalaxy.org http://galaxyproject.org/wiki/Support
ADD COMMENTlink written 7.1 years ago by Jennifer Hillman Jackson25k
Hi, I have some refseq gene id, like NM_***** and NR_******. I know how to transfer NM_****** into protein ID NP_*****. But, how to transfer NR_***** into protein id, like NP_****? I do not know. Could you please tell me? Thank you very much! Victor ________________________________________ To: Li, Jilong (MU-Student) Cc: galaxy-user@bx.psu.edu Subject: Re: [galaxy-user] how to transfer gene id into protein id Hello, If the reference genome is in UCSC and has a RefSeq track, then you can extract a file with the transcript and protein identifiers from the Table Browser called "refLink" and subset it for rows in your query RefSeq transcript identifiers. If the RefSeq data is at BioMart or another source, a similar path to the one I outline below will work with some modifications, it all depends on the file format, but Galaxy's tools can manipulate data is just about every way you will need. Using a transcript identifier query, subset protein identifiers in a UCSC RefSeq track: A. Load your list of NM* identifiers ("Get Data -> Upload). - set the file format to "tabular" (use "pencil" icon to "Edit Attributes -> Change data type") if needed. B. Load RefSeq id mapping data with "Get Data -> UCSC Main" and set the form parameters as needed, choosing the track "RefSeq Genes" and the table "refLink". Make sure the region is the entire genome. Send to Galaxy formatted as-is (tabular). B. Next, cut columns 3 and 4 out of the table with tool "Text Manipulation ->Cut" and the options "c3,c4". C. OPTIONAL, if you want the full list of coding RefSeqs for another purpose... remove the non-coding RefSeqs with the tool "Filter and Sort -> Select" and the options "that: NOT Matching" and "the pattern: ^NR_.*$". Be sure to enter the regular expression '^NR_.*$' without any quotes. D. Perform a join using "Join, Subtract and Group -> Compare two Datasets" with the options>: - "Compare: <file of="" trans="" and="" prot="" id,="" filtered="" or="" not="">" - "Using column: c1" where c1 is the trans ids - "against: <file of="" trans="" ids="">" - "and column: c1" where c1 is the trans ids - "To find: Matching rows of first dataset" E. Result dataset is a two column tabular file: transcript id <tab> protein id Hopefully this helps you and others who are doing a similar task. If you think you will be doing this a lot, be sure to consider extracting the steps into a workflow. Thanks for using Galaxy, Jen Galaxy team -- Jennifer Jackson http://usegalaxy.org http://galaxyproject.org/wiki/Support
ADD REPLYlink written 7.1 years ago by Li, Jilong (MU-Student)80
Hello Victor, RefSeq sequences designated by a transcript identifier formatted as NR_* are non-coding (meaning: transcribed, but not translated), therefore there is no protein product and no linked protein sequence NP_* identifier. This documentation from NCBI covers RefSeq naming conventions: http://www.ncbi.nlm.nih.gov/RefSeq/key.html Hopefully this is helpful, Best, Jen Galaxy team -- Jennifer Jackson http://usegalaxy.org http://galaxyproject.org/wiki/Support
ADD REPLYlink written 7.1 years ago by Jennifer Hillman Jackson25k
Hi Victor It is not really a Galaxy related answer....but you might wanna study the following webpage explaining the RefSeq Accession Format: http://www.ncbi.nlm.nih.gov/RefSeq/key.html#accession Strictly speaking, there is no such thing as a "refseq gene id", since RefSeq entries describe individual molecules. There is a new subset of RefSeq called 'RefSeqGene", see: http://www.ncbi.nlm.nih.gov/refseq/rsg/ but I don't think this is what you are after. Hence, you can crosslink 'mRNA' (ie: NM_*****) to proteins (ie: NP_*****) and Jen gave you an excellent recipe how to do that in Galaxy. However, you cannot crosslink 'RNA' (ie: NR_*****, which are "non- coding transcripts including structural RNAs, transcribed pseudogenes, and others.") to proteins! I hope this clarifies the confusion. Regards, Hans
ADD REPLYlink written 7.1 years ago by Hotz, Hans-Rudolf1.8k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 168 users visited in the last hour