Galaxy local installation - download reference genomes

Question: Galaxy local installation - download reference genomes

2.7 years ago by

I have installed a local instance of galaxy and a couple of tools on an Ubuntu server. My intention is to get the GATK pipeline for variant calling implemented. To run bwa I need a reference genome so I need to get one from somewhere.

Personally I find the documentation for Galaxy very confusing and "all over the place". However, if I understood things correctly I needed to install the rsync data manager - which I did.

At this point, when I click on the "Local data" menu item which is under the "Data" heading in the menu panel on the left, I then have an item, "Reference Genome - fetching" under the "Run Data Manager Tools" section in the big panel to the right of the menu panel. Clicking on that I get a form to fill in. I selected hg19 for this and then clicked "Execute".

Looking in the "Manage Jobs" area, this job has now been running for more than 15 hours.

What is this doing? Is it downloading the reference genome? Where is it downloading to so that I can check the progress? Is it downloading indices too (because I could not find any explanation of how this is done in the docs and videos that I looked at)?

I appreciate any and all help.

Kind Regards Jannnetta

galaxy • 2.4k views

ADD COMMENT • link •

modified 2.6 years ago • written 2.7 years ago by jannetta.steyn • 10

2.7 years ago by

Jennifer Hillman Jackson ♦ 25k

United States

Jennifer Hillman Jackson ♦ 25k wrote:

Hello,

If you want to use GATK, then obtaining the reference genome for hg_g1k_v37 is the correct database to select from the rsync server. More here (a supplemental resource to the wiki): Fasta Format, Custom Genomes, and GATK Chromosome ordering

Please be aware that this was obtained from the GATK version v1 resource bundle and will work well with the GATK tools at revision 1.4, but not later copies.

A better approach if loading data to use with newer GATK tools (from the Tool Shed) is to obtain the needed data from the latest GATK resource bundle. Fasta data from the history can be imported (fetch) then indexed (samtools, picard, bwa, etc). Fetch and index in this order. The newer GATK tools do not have a specific data manager associated with them, due to the way that GATK is now licenced.

The location of the data is noted in the galaxy.ini file (both default and custom locations that you may have specified).

More about data, including how to install/index without data managers is covered in detail in the wiki. Start here: https://wiki.galaxyproject.org/Admin Section: Data Resources

Thanks, Jen, Galaxy team

ADD COMMENT • link written 2.7 years ago by Jennifer Hillman Jackson ♦ 25k

2.7 years ago by

jannetta.steyn • 10

jannetta.steyn • 10 wrote:

Hi Jen

Thank you very much for responding to my question.However, I'm still at a loss.

Firstly, I don't know why anyone would want to use version 1.4 of anything when the current version is 3.5. So I'm not even looking at version 1.4. Version 3.0 was release over two years ago.

I Installed "gatk" from the Toolkit which seems to be version 3. So where do I find the GATK resource bundle? Fasta data from which history can be imported? I don't know what you mean by history? Do you mean the Galaxy history? I don't need to do any preparation in Galaxy, it is all done.

I have a reference genome that I have downloaded and indexed and used with bwa on a cluster. I usually do these things on Linux in a console, writing my own scripts. I'm evaluating Galaxy to see if it can be used for students. So maybe my first question should have been whether it is possible to use Galaxy to create a best practises GATK pipeline (the latest version off course or else it is not best practise any more).

If it is not possible then don't even read further, just let me know and I'll need to find another way. If possible then this is what I tired to do. I tried to follow the instructions on https://wiki.galaxyproject.org/Admin/DataIntegration. The first pitfall is a reference to a file called universe_wsgi.xml. There exists no such file in the Galaxy directory structure. Assuming that the default directory for the genome is tool-data/genome I copied my genome and it indices into this directory under a sub-directory called UCSC. I updated alignseq.loc with the line:

align something hg19js /home/jannetta/galaxy/genome/UCSC

But that didn't mean much to me and I can't understand what they are trying to say in the comments so the fields don't mean much to me apart from the path that points to the directory containing the .fa and index files.

I also updated bwa_mem_index.loc with this line: hg19js hg19js hg19js /home/jannetta/galaxy/tool-data/genome/UCSC/ucsc.hg19.fasta

I'm also not sure what all of that is supposed to mean.

After all that the genome still didn't show up in the drop box when trying to use bwa in Galaxy.

So where do I go from here?

Regards Jannetta

ADD COMMENT • link modified 2.7 years ago • written 2.7 years ago by jannetta.steyn • 10

Hello Jannetta,

GATK v 1.4 is the version at http://usegalaxy.org (Galaxy Main) and was the last revision without the modified GATK licencing, phone home, etc. These tools have been deprecated on Main.

On your local or cloud Galaxy, using the updated versions of GATK is the way to go to use the best practise workflows from the Broad. However, as mentioned, there are no data managers created yet, due to licencing. But these could certainly be created by the community for their own use.

The "create your own indexes" wiki is a bit outdated, true. Data managers are the preferred way to install data for most users now. I updated the wiki for the source config file - thanks for pointing that out - it is now correct: config/galaxy.ini

The GATK "Resource bundle" can be found here: https://www.broadinstitute.org/gatk/download/

Load the fasta file of the target genome into a history. Then use the fetch fasta data manager that permits the creation of a new dbkey, with the source set on the form to use a "fasta file from a history". This would be the GATK-sorted fasta file from the Resource bundle.

From there, the other data managers can be used directly as described above. Be sure to index in the order described (samtools first, picard and 2bit next, then proceed with others that you want to add in). No need to do more manually. This includes creating indexes for BWA-MEM.

Removing partial data is complicated and not recommended. Starting over with a fresh instance is best.

For the reason why your indexes did not show up when adding in data manually, there can be a few reasons. Tabs instead of spaces in the .loc files, the server was not restarted, the builds.txt file does not content the target dbkey (hg19js in your example above), incorrect content for loc file fields. But all of these will resolve if you load the original fasta file from the GATK resource bundle and use the data managers for other steps.

A to-do DM is one that would add in liftOver files - this is on our list to implement, but did not have a formal ticket - I created one here https://github.com/galaxyproject/galaxy/issues/1904. So if you want this data now (it is only available for UCSC sourced genomes), manually adding the data is required. I would load directly from UCSC and not the rsync server as we are also in the process of updating this data and you might not get all or the most current files right now from that source. The wiki does not cover this, but in short you create a folder for the data, rsync the files from the UCSC downloads area, uncompress, then add each to the liftover.loc file.

Parts of this question are also at the Biostars.org forum here and some of this reply addresses both: https://www.biostars.org/p/179644

Hopefully this helps you to make your decision, Jen, Galaxy team

ADD REPLY • link modified 2.7 years ago • written 2.7 years ago by Jennifer Hillman Jackson ♦ 25k

2.6 years ago by

jannetta.steyn • 10

jannetta.steyn • 10 wrote:

Hi Jen

I am only now getting around to try and do what you suggested here. I installed " fetch fasta data manager that permits the creation of a new dbkey" but I don't know how to use it. Where do I find it? I have logged into my Galaxy instance with admin rights. I then go to Local data where I find "Create DBKey and Reference Genome" but it doesn't give me the option to use a fasta file from history.

ADD COMMENT • link written 2.6 years ago by jannetta.steyn • 10

Please log in to add an answer.

Similar posts • Search »