If the reference genome is in UCSC and has a RefSeq track, then you
extract a file with the transcript and protein identifiers from the
Table Browser called "refLink" and subset it for rows in your query
RefSeq transcript identifiers.
If the RefSeq data is at BioMart or another source, a similar path to
the one I outline below will work with some modifications, it all
depends on the file format, but Galaxy's tools can manipulate data is
just about every way you will need.
Using a transcript identifier query, subset protein identifiers in a
UCSC RefSeq track:
Load your list of NM* identifiers ("Get Data -> Upload).
- set the file format to "tabular" (use "pencil" icon to "Edit
Attributes -> Change data type") if needed.
Load RefSeq id mapping data with "Get Data -> UCSC Main" and set the
form parameters as needed, choosing the track "RefSeq Genes" and the
table "refLink". Make sure the region is the entire genome. Send to
Galaxy formatted as-is (tabular).
Next, cut columns 3 and 4 out of the table with tool "Text
->Cut" and the options "c3,c4".
C. OPTIONAL, if you want the full list of coding RefSeqs for another
purpose... remove the non-coding RefSeqs with the tool "Filter and
-> Select" and the options "that: NOT Matching" and "the pattern:
^NR_.*$". Be sure to enter the regular expression '^NR_.*$' without
D. Perform a join using "Join, Subtract and Group -> Compare two
Datasets" with the options>:
- "Compare: <file of="" trans="" and="" prot="" id,="" filtered="" or="" not="">"
- "Using column: c1" where c1 is the trans ids
- "against: <file of="" trans="" ids="">"
- "and column: c1" where c1 is the trans ids
- "To find: Matching rows of first dataset"
Result dataset is a two column tabular file:
transcript id <tab> protein id
Hopefully this helps you and others who are doing a similar task. If
think you will be doing this a lot, be sure to consider extracting the
steps into a workflow.
Thanks for using Galaxy,