Building a human reference genome

Question: Building a human reference genome

13 months ago by

Dear Biostars,

I would like to build the index for human genome to be used as a reference genome in a local instance of Galaxy. The starting point is to obtain the genome files from UCSC through 'ftp' to 'hgdownload.cse.ucsc.edu'.

If I navigate to the ./hg38/bigZips , I would find many different files. I am not sure which one to download.

I was thinking of downloading the individual chromosome files then combine them into one .fa file as I did months ago with the genome mouse Previous post.

But I would like to know what do you think about this and whether you have a better approach.

galaxy • 583 views

ADD COMMENT • link •

modified 13 months ago by Jennifer Hillman Jackson ♦ 25k • written 13 months ago by mohammedtleis • 0

13 months ago by

Jennifer Hillman Jackson ♦ 25k

United States

Jennifer Hillman Jackson ♦ 25k wrote:

Hello,

The combined fasta files are hg38.2bit (soft masked, which can be converted to fa using UCSC's utilities), hg38.fa.gz (soft masked), and hg38.masked.fa.gz (hard masked).

But you can also do the combining yourself. One reason to do that is if you wanted to create a special variant of the genome. Or if you simply prefer to do this line command and have complete control over the content (soft vs hard masked, include haplotypes/unmapped or not, etc).

Another choice is using Data managers for fetching and indexing. When UCSC is set as the source during a fetch, it creates a soft-masked build that includes everything when using the default options.

Hope that helps! Jen, Galaxy team

ADD COMMENT • link modified 13 months ago • written 13 months ago by Jennifer Hillman Jackson ♦ 25k

Thanks Jennifer!

I downloaded the hg38.fa.gz, but I couldn't upload it to galaxy since it is larger than 2 GB and It asked to do so via ftp? So far, I found this link: Install some ftp server. I did install and configured proftpd. But I am not sure how to connect it to Galaxy's database. Is there any information on this?

I am also curious what UCSC tools can be used to index the genome

Best Regards, M. Tleis

ADD REPLY • link modified 13 months ago • written 13 months ago by mohammedtleis • 0

You would only need to upload the fasta to Galaxy if you intend to customize the index (not use one of the originals) or use it as a Custom reference genome (not recommended for a genome of this size). For the Data Manager path, this would involve uploading the fasta to a Galaxy history and using that fasta as an input to a Data Manager that indexes the base genome. These are the "Fetch genome" DMs, then the others (samtools, picard, 2bit, and other indexes per-tool).

If you plan to just use the base genome, just use the Data Managers directly. It will fetch the genome without you needing to do anything special (no uploading files, etc). This is about same process: use a fetch genome data manager sourcing from UCSC - instead of an uploaded fasta - then run the other DMs (in order).

UCSC tools are not needed to create indexes. These do many things, but in this context would only be used to convert formats (twoBitToFasta, or the reverse, which is not needed in your case since the fasta is already available and there is a DM to convert and index a fasta already loaded with a DM to a 2bit index).

The final option is to create all indexes manually. This is really not recommended unless you are experienced with it and are willing to troubleshoot. Data Managers should be used if at all possible - things will go much smoother. But I'll link the help pages for manually creating indexes and other related tasks below - just be aware that these docs are a bit older and as I said, might require you to do some troubleshooting. We don't provide step-by-step manual index install documents at a detailed level anymore - the DMs have replaced that need, as even customized genomes can be used with them (if you load the target fasta by FTP into a history or into a Data Library first from the file system then into a working history, link also included):

There are a few other ways to create indexes in a local if you want to try (good for adding a batch of genomes at once). These do require more line command work, where DMs are all GUI based. Specifically, the genomes indexed at https://usegalaxy.org are available from the rsync server (covered in the link above) and there is a method to create indexes in a local using the data organizer plugin Ememeris (https://github.com/galaxyproject/ephemeris).

I hope that between these options there is a choice that will work out for you! Jen, Galaxy team

ADD REPLY • link modified 13 months ago • written 13 months ago by Jennifer Hillman Jackson ♦ 25k

Similar posts • Search »