Hi I am currently using GRCh/hg38 as my reference. I would like to inquire if there is 100-way multiZ alignment (with hg38 as reference) available in Galaxy. And if not, what are the other options available?
Hello,
The current MAF/multiz alignment to use is based off of hg19. There are older builds, yet this is the latest for human.
Updating to hg38 is not an immediate goal, but I will add it to the request list (scroll down to see the post: https://github.com/galaxyproject/galaxy/issues/1470).
Thanks, Jen, Galaxy team
Hello, As someone who has been using hg38 as the reference, which advice do you have for me in order to see progress? Thanks.
The link I shared has a list of to-do items for data additions and now the MAF for hg38 is added in. You can follow progress there.
I should let you know that there is one other option - using MAF data from the history with the tool. This involves obtaining the hg38 MAF data from the UCSC Downloads area (found under Human, hg38, "Conservation") and loading the files into Galaxy. I do not know how large it is uncompressed, so it may or may not fit into the 250 GB account quota at http://usegalaxy.org unless you clear out (permanently delete) other work. But, it is a choice you could explore. I just testing this functionality out yesterday (for a different test) using just MAF data from a single chromosome and the MAF tools functioned without issue.
Do not attempt to extract this data from the UCSC Table Browser as the data is too large and will be truncated for most chromosomes. Locate the data in the UCSC Downloads area and load by URL or download locally then load using FTP.
This is where exactly to get it. The files you want are those named like chrNNN.maf.gz. Once in your history, use the tool Concatenated to create a single reference MAF dataset (and the per-chrom datasets perm deleted to recover space, after a successful data merge is confirmed).
This could all be done on a local/cloud Galaxy as well, given sufficient resources.
Best, Jen, Galaxy team
Thank you for your help and explaining in detail.
From the UCSC download link, I am trying to download the multiz100way alignments. In the "maf" folder, I can see several maf.gz files for all the chromosomes. Some of the them are like chr22.maf.gz and so on... and some of them are like chr22_GL383583v2_alt.maf.gz and chr22_KI270731v1_random.maf.gz. Please see the screenshot of files listed in the folder (attached below). There is nothing in README to explain what is in these files. Do I just need the chromosome files (chr22.maf.gz) or all the files (every file related to the chromosome)?
Your help is much appreciated.
Just use the primary MAFs unless your query bed dataset contains these additional chromosome variants. The data will only link if the chromosome name in your query is an exact match for the chromosome name in the MAF. Meaning, you could get all, but only those that are a match will be part of the analysis.
Example: for data coordinates based on chr22, use the chr22.maf.gz file.