Question: Building a human reference genome
0
gravatar for mohammedtleis
13 months ago by
mohammedtleis0 wrote:

Dear Biostars,

I would like to build the index for human genome to be used as a reference genome in a local instance of Galaxy. The starting point is to obtain the genome files from UCSC through 'ftp' to 'hgdownload.cse.ucsc.edu'.

If I navigate to the ./hg38/bigZips , I would find many different files. I am not sure which one to download.

I was thinking of downloading the individual chromosome files then combine them into one .fa file as I did months ago with the genome mouse Previous post.

But I would like to know what do you think about this and whether you have a better approach.

galaxy • 583 views
ADD COMMENTlink modified 13 months ago by Jennifer Hillman Jackson25k • written 13 months ago by mohammedtleis0
0
gravatar for Jennifer Hillman Jackson
13 months ago by
United States
Jennifer Hillman Jackson25k wrote:

Hello,

The combined fasta files are hg38.2bit (soft masked, which can be converted to fa using UCSC's utilities), hg38.fa.gz (soft masked), and hg38.masked.fa.gz (hard masked).

But you can also do the combining yourself. One reason to do that is if you wanted to create a special variant of the genome. Or if you simply prefer to do this line command and have complete control over the content (soft vs hard masked, include haplotypes/unmapped or not, etc).

Another choice is using Data managers for fetching and indexing. When UCSC is set as the source during a fetch, it creates a soft-masked build that includes everything when using the default options.

Hope that helps! Jen, Galaxy team

ADD COMMENTlink modified 13 months ago • written 13 months ago by Jennifer Hillman Jackson25k

Thanks Jennifer!

I downloaded the hg38.fa.gz, but I couldn't upload it to galaxy since it is larger than 2 GB and It asked to do so via ftp? So far, I found this link: Install some ftp server. I did install and configured proftpd. But I am not sure how to connect it to Galaxy's database. Is there any information on this?

I am also curious what UCSC tools can be used to index the genome

Best Regards, M. Tleis

ADD REPLYlink modified 13 months ago • written 13 months ago by mohammedtleis0

You would only need to upload the fasta to Galaxy if you intend to customize the index (not use one of the originals) or use it as a Custom reference genome (not recommended for a genome of this size). For the Data Manager path, this would involve uploading the fasta to a Galaxy history and using that fasta as an input to a Data Manager that indexes the base genome. These are the "Fetch genome" DMs, then the others (samtools, picard, 2bit, and other indexes per-tool).

If you plan to just use the base genome, just use the Data Managers directly. It will fetch the genome without you needing to do anything special (no uploading files, etc). This is about same process: use a fetch genome data manager sourcing from UCSC - instead of an uploaded fasta - then run the other DMs (in order).

UCSC tools are not needed to create indexes. These do many things, but in this context would only be used to convert formats (twoBitToFasta, or the reverse, which is not needed in your case since the fasta is already available and there is a DM to convert and index a fasta already loaded with a DM to a 2bit index).

The final option is to create all indexes manually. This is really not recommended unless you are experienced with it and are willing to troubleshoot. Data Managers should be used if at all possible - things will go much smoother. But I'll link the help pages for manually creating indexes and other related tasks below - just be aware that these docs are a bit older and as I said, might require you to do some troubleshooting. We don't provide step-by-step manual index install documents at a detailed level anymore - the DMs have replaced that need, as even customized genomes can be used with them (if you load the target fasta by FTP into a history or into a Data Library first from the file system then into a working history, link also included):

There are a few other ways to create indexes in a local if you want to try (good for adding a batch of genomes at once). These do require more line command work, where DMs are all GUI based. Specifically, the genomes indexed at https://usegalaxy.org are available from the rsync server (covered in the link above) and there is a method to create indexes in a local using the data organizer plugin Ememeris (https://github.com/galaxyproject/ephemeris).

I hope that between these options there is a choice that will work out for you! Jen, Galaxy team

ADD REPLYlink modified 13 months ago • written 13 months ago by Jennifer Hillman Jackson25k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 169 users visited in the last hour