Question: Retaining Or Reclaiming Bed Ids
0
gravatar for Gold, Bert (NIH/NCI) [E]
4.9 years ago by
Hi! Having provided a name (field 4) in a UCSC bed file ( http://www.genome.ucsc.edu/FAQ/FAQformat.html#format1 ) and sought a RefSeq name using the UCSC Table Browser ( http://www.genome.ucsc.edu /cgi-bin/hgTables ), I would now like to recover which line of the bed file delivered which line of the output fileĀ… However, I am told I need Galaxy to provide a workflow to do this. Can anyone explain how? eg, one line of my bedfile looks like: chr2 2723752 2723777 seqid6354405 0 - and one line of my intersected table browser output looks like: chr1 176432306 176811970 NM_020318 0 + 176525458 176811590 0 23 248,1835,1072,146,294, 193,122,490,129,92,194,147,136,217,172,178,214,169,136,110,72,99,455, 0,92236,131353,207799,226966,228955,232567,235929,239436,243188,246812 ,248664,276455,276809,302495,306436,307796,326638,328176,330389,336890 ,377002,379209, Clearly the first line of my bed doesn't correspond to the first line of my intersection output, but as my bed is long, what reference can I use to unambiguously identify which line of output the first line of my intersection corresponds to? How do I do this in Galaxy? PS - I tried this workflow earlier today without success, aiming to achieve a similar objective: https://usegalaxy.org/u/james/w/workflow- from-ucsc-genes-and-symbols PPS- I also note similar issues were raised in this discussion, with Galaxy promoted as the solution, but with no real details about how to achieve the desired results: http://redmine.soe.ucsc.edu/forum/index.php?t=msg&goto=10615&S=0d1b303 e6dfdceaf3b240804fd0f52aa Bert Gold, Ph.D., FACMG Staff Scientist NCI-Frederick Frederick, MD 21702 VOICE: 301-846-5098 EMAIL: golda@mail.nih.gov
galaxy • 898 views
ADD COMMENTlink modified 4.9 years ago by Jennifer Hillman Jackson25k • written 4.9 years ago by Gold, Bert (NIH/NCI) [E]10
0
gravatar for Jennifer Hillman Jackson
4.9 years ago by
United States
Jennifer Hillman Jackson25k wrote:
Hello Bert, The problem you had with the other workflow most likely had to do with using BED12 format instead of BED3 or BED6. BED12 represents one or more regions, where BED3-6 represents a single region. This is an important distinction for how the 'Operate on Genomic Intervals" tools function - a single region is required. There are two basic ways do this and both would require that the query is re-run so that the RefSeq output is in BED6 - or a single region (interval) per line. The second more direct as the filtering is done only once, in Galaxy, but I will break down both and you can choose. Other manipulations to isolate the name or do counts once the data is all in one file will be similar to the workflow from James. You do not need to run these steps as a workflow the first time while sorting out the parameters. Instead run the steps, evaluate and tune as needed, then create a workflow from the history for future queries. BED and Interval format are very similar, a description of interval is here: https://wiki.galaxyproject.org/Learn/Datatypes#Interval Method 1: 1. Re-run the query as you have already performed it, but instead of selecting "Whole gene" as output, instead select "Exons". This will result in one line of output for each match, and possibly multiple lines per RefSeq if more than one input query coordinate region overlaps it. The "name" field will be annotated with the RefSeq identifier and the exon name (which you can break-up/simplify later using tools from the group "Text Manipulation"). 2. With both datasets in Galaxy (the query bed file and the output from #1), double check that metadata assignments are correct by clicking on the pencil icon. (chrom, start, end, name, strand) https://wiki.galaxyproject.org/Learn/Managing%20Datasets#Dataset_Icons _.26_Text 3. Now run the "Operate on Genomic Intervals -> Join" tool - most likely with "overlap=1" and "inner join" settings, but review the options and decide. 4. This places all of your data in a single file, both intervals side-by-side. From here you can cut out columns, do counts (tool "Group"), etc. Method 2: 1. Instead of running the initial query at UCSC with the first bed file as a filter for the RefSeq dataset, run the query without a filter and just extract all RefSeq exons into Galaxy. 2. Make sure both datasets are loaded and double check the metadata. 3. Run the "Join" tool again to merge the two datasets based on coordinate overlap as above. 4. Rearrange/continue as wanted. This includes isolating the RefSeq name and merging it back with any other dataset that includes that same RefSeq name, with the other "Join" tool in the group "Join, Subtract and Group". When running the query - be sure to use the correct "Join" tool at each step. One will match on common keys (a "name") and one on overlapping coordinates. Be sure to use one in the group "Operate on Genomic Intervals" for the first part of your query. We have a couple of tutorials that demonstrate how these tools can be used, along with how to extract a workflow. Galaxy101: https://usegalaxy.org/u/aun1/p/galaxy101 UsingGalaxy, Protocol1: https://usegalaxy.org/u/galaxyproject/p/using-galaxy-2012 Hopefully this helps, Jen Galaxy team -- Jennifer Hillman-Jackson http://galaxyproject.org
ADD COMMENTlink written 4.9 years ago by Jennifer Hillman Jackson25k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 179 users visited in the last hour