How to find genetic distance for a set of genomic intervals

Question: How to find genetic distance for a set of genomic intervals

3.6 years ago by

United States

I have a set of genomic regions (in bed format) and I want to find their cM distance from the nearest gene.

I know that it would need the human genetic map for the recombination rates. I am considering hg18 reference genome here. So I got the recombination map from here: http://www.well.ox.ac.uk/%7Eanjali/AAmap/

The format of the recombination map is something like this:

Physical_Position_Build36(hg18) Genetic_Map(cM)
742584 0
744045 8.96305756252859e-09
750775 4.12689390157335e-08
758311 8.14337595996149e-08
766409 1.32596464295120e-07
769185 1.46222766745286e-07

But my genomic intervals are like these:

chr1   751448   752765   NR_024321
chr1   752833   784689   NR_047519
chr1   752833   768847   NR_047526
chr1   752833   784689   NR_047524
chr1   752833   784689   NR_047523

Should I just find the position of the reference map for each genomic interval and assign the cM value for that interval? This will probably just give the distance of each region in cM, but I want their distance from the nearest gene in cM.

Or if there is some other way? Or is there any tool do do that?

Please help!

galaxy samtools • 1.5k views

ADD COMMENT • link •

modified 3.6 years ago by Jennifer Hillman Jackson ♦ 25k • written 3.6 years ago by pooja.narang • 0

3.6 years ago by

Jennifer Hillman Jackson ♦ 25k

United States

Jennifer Hillman Jackson ♦ 25k wrote:

Hello,

Using the genomic intervals to find the closest gene is fairly straightforward (see the tool "Operate on Genomic Intervals -> Fetch closest non-overlapping feature". All you would need to do is choose the reference annotation you wish to compare against (UCSC is a good source - using the Table browser and sending to Galaxy in BED format). These would be transcripts - but there are ancillary files for each primary track that can consolidate/map transcript to gene (various "gene" types - Gene Symbols, HUGO, Ensembl, and more). Exact with the tool "Get Data -> UCSC Main".

In order to do the next step, a protocol that goes something like this might work.

identify the closest transcript/gene for each interval
create an interval file that represents the region between the interval and nearest transcript/gene
add in the position (chrom/start/end) in interval/bed format for the genetic map info
use #2 to define the regions to extract from the data in #3

#4 would be best done using the same protocol as described for the sample "random distance" calculations at the same web site where you obtained the reference file above (example samples here). The calculations themselves can almost certainly be done within Galaxy using "Text manipulation" and similar tools. When you come up with a successful protocol, please consider sharing/publishing it on Galaxy Main for others to use and posting back the share link here.

I am not aware of any specialized tools for this type of manipulation in the Tool Shed, but you could review. These tools are for use in a local/cloud Galaxy.

There could also very well be specialized tools for this type of manipulation at a public Galaxy instance. Each has their own focus and tools change through time. Reviewing is the best way to see if any are a fit (but no guarantees!): Galaxy Public Servers

Best, Jen, Galaxy team

ADD COMMENT • link written 3.6 years ago by Jennifer Hillman Jackson ♦ 25k

Similar posts • Search »