Mappability

Question: Mappability

6.6 years ago by

I want to make an intersection between a few hundreds of genomic intervals (predicted translocation sites from SVDetect) and low mappability regions in genomes (we are working with mm9 right now). UCSC has an excellent mappability track that exactly matches our sequencing data (50 bp kmers), but it seems very difficult to get that data into Galaxy. I want a BED format that summarizes intervals of low mappability (ie. less than 0.5 on the scale used by UCSC). The UCSC Table Browser has a limit of 10M lines, which seems to give just part of chromosome 1. It will be very messy to try to get the whole genome bit by bit using this method and then stitch it back together using some sort of concatenation. UCSC Help suggests downloading the mappability data for the whole genome as a bigwig formatted file, then convert to BED. I gave this a try, but we get a 4 GB file, with intervals of just one or two base pairs. Again, lots of work to get back to the nicer BED that I could make with the UCSC tools over smaller genomic regions. Also, super- painful to upload this huge file to Galaxy, and unhappy trying to write my own parsers to filter and smooth this file. Any other suggestions? Maybe someone else knows where to find a mappability file (for mm9) that has nice intervals in a Galaxy compatible format. Stuart Brown

galaxy • 1.0k views

ADD COMMENT • link •

modified 6.6 years ago by Jennifer Hillman Jackson ♦ 25k • written 6.6 years ago by Brown, Stuart • 30

6.6 years ago by

Jennifer Hillman Jackson ♦ 25k

United States

Jennifer Hillman Jackson ♦ 25k wrote:

Hi Stuart, If you are able to rsync the Mapability bigWig file from the UCSC downloads server and covert to BED using their compiled tools (also available on same server), then the rest should be fairly straightforward. 1 - Load the data into Galaxy using FTP: http://wiki.g2.bx.psu.edu/FTPUpload 2 - Merge the fragmented intervals into ranges that better suit your needs with Galaxy tools in the group "Operate on Genomic Intervals", in particular see the "Merge" and "Cluster" tools. This data is large, but the only way to determine if it is too large to run on the public main instance is to try. If you end up with a memory error, then moving to a local or cloud instance would be the recommendation. Full instructions are here: http://usegalaxy.org Hopefully this simplifies the process for you! Best, Jen Galaxy team -- Jennifer Jackson http://galaxyproject.org

ADD COMMENT • link written 6.6 years ago by Jennifer Hillman Jackson ♦ 25k

Similar posts • Search »