Loading human genome hg38 reference sequence- completely stuck

Question: Loading human genome hg38 reference sequence- completely stuck

12 months ago by

hmb106 • 10

hmb106 • 10 wrote:

I am running a local version of galaxy (17.09) trying to analyze RNA seq data with Bowtie2, DEseq2. I am the admin and have gotten this instance mostly configured. I have managed to get all the sequences loaded in to the Data libraries off my hard drive as I haven't had time to set up a FTP server. I have been trying like crazy to get the ref genomes and GFF/GTF files loaded to start my analyses. I have been all over the wiki and here on biostars looking at how to load a reference genome for hg38. I watched the video "managing galaxy's built in data and data managers." as well as looked at several very similar questions. I used the used the tool shed to obtain DMs (create DBkey and ref genome) and followed the directions in the Video, tried to build Bowtie2 indexes, and got error. I realized then I might need to do some additional indexes first as per several threads. So I got the tools for Samtools indexes, Picard indexes, and twoBit indexes. When I attempted to run the samtools index from the data manger. I still get the following error telling me it can't find a file?

Fatal error: Exit code 1 () Traceback (most recent call last): File "/Users/broxn8/shed_tools/toolshed.g2.bx.psu.edu/repos/devteam/data_manager_sam_fasta_index_builder/1865e693d8b2/data_manager_sam_fasta_index_builder/data_manager/data_manager_sam_fasta_index_builder.py", line 92, in main() File "/Users/broxn8/shed_tools/toolshed.g2.bx.psu.edu/repos/devteam/data_manager_sam_fasta_index_builder/1865e693d8b2/data_manager_sam_fasta_index_builder/data_manager/data_manager_sam_fasta_index_builder.py", line 85, in main build_sam_index( data_manager_dict, options.fasta_filename, target_directory, options.fasta_dbkey, sequence_id, sequence_name, data_table_name=options.data_table_name or DEFAULT_DATA_TABLE_NAME ) File "/Users/broxn8/shed_tools/toolshed.g2.bx.psu.edu/repos/devteam/data_manager_sam_fasta_index_builder/1865e693d8b2/data_manager_sam_fasta_index_builder/data_manager/data_manager_sam_fasta_index_builder.py", line 40, in build_sam_index proc = subprocess.Popen( args=args, shell=False, cwd=target_directory, stderr=tmp_stderr.fileno() ) File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py", line 710, in __init__ errread, errwrite) File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py", line 1335, in _execute_child raise child_exception OSError: [Errno 2] No such file or directory

So now I am totally confused how to resolved this and am not seeing a clear answer in the biostar responses or maybe I missed it?

I'm just learning command line codes and thought using galaxy to run my analyses would be easy especially since it's supposed to be for those without extensive computational experience. I have too much data to use the main instance which was fairly easy to use. It didn't seem like it should be that hard to setup a local instance, but I have to say, these ref genome instructions are seemingly not detailed enough for some of us. As this seems to be a re-occurring question all the time, perhaps it's time to write up a more detailed version of the instructions and by detailed I mean so that a child could do it. I'm sure it's something simple but I just can't figure it out! If anyone can help me figure this out, I'd be most appreciative!!! Best, Heather

rna-seq ref genome bowtie hg38 • 622 views

ADD COMMENT • link •

modified 12 months ago • written 12 months ago by hmb106 • 10

12 months ago by

Jennifer Hillman Jackson ♦ 25k

United States

Jennifer Hillman Jackson ♦ 25k wrote:

Hello,

You don't need to upload the fasta version of the genome yourself for hg38. A fasta genome can be used from the history with the "fetch" Data manager, but that should only be for cases where the genome is not already available from a source known to the tool.

DMs should be run in this order:

Fetch genome and (optionally) create dbkey. hg38 should already be in your list of databases if you obtained the latest Galaxy version 17.09 from github following the https://getgalaxy.org instructions (this is just a list - no data is behind the databases until you install it with a data manager).

Then create these indexes for that same dbkey (existing or new), in this order. I would recommend doing this for all genomes you plan to natively index.

Sam indexes
Picard indexes
2bit indexes

From here, use DMs to install any other indexes that you want to use. The order is not important anymore. For example:

Bowtie2/Tophat2 indexes (created with the same DM)
HISAT2 indexes
Any others

Reference annotation dataset (GTF, GFF3) are good to place in a data library. When you want to use the reference GTF/GFF3 data with tools, copy these into the working history and select the target files from that history on the tool forms.

Warning: because there may now be dbkey conflicts in your database, and these are difficult to remove, you have two choices: 1) Start over with a fresh instance and install all genomes this way 2) create a new dbkey during the fetch step - your target is still UCSC and the "hg38" genome, but locally you will be giving it some other name, otherwise you could end up with duplicated indexes associated with the hg38 dbkey (database name).

If you choose to do 2) - then you'll miss out on being able to use direct links to UCSC or IGV/B for visualization. This makes the "start over" option better. We have open enhancement issues to make it easier to "undo" data manager changes that are not valid, but those are still a work-in-progress. For now, the indexes need to be created this way, in this order, without any mistakes, to be successfully installed.

A new set of tutorials for data managers is a good idea, however, a local Galaxy is not really intended for use by those without admin support. For a robust local Galaxy, some line-command/admin experience is needed for the advanced configuration (https://github.com/galaxyproject/dagobah-training). For this reason, for scientific users, a Cloudman Galaxy is often a better choice (AWS offers grants in education that can help cover costs). Or a Docker Galaxy if your server/computer has the compute resources that tools will need. Resources that tools need do not change when run within Galaxy (local or docker) - so if you are targetting a specific tool, review what the memory and other requirements are first - some require a lot of memory to run (example: RNA STAR). This is also what makes Cloudman a good choice because you can pick master/node size/resources based on needs. Review all choices here: https://galaxyproject.github.io/

Thanks! Jen, Galaxy team

ADD COMMENT • link modified 12 months ago • written 12 months ago by Jennifer Hillman Jackson ♦ 25k

12 months ago by

hmb106 • 10

hmb106 • 10 wrote:

Thanks for the prompt and detailed answer. I will give this a try today with a clean install as I bet the DBKey duplicates maybe the issue. Unfortunately, I don't have the bandwidth/access here to properly utilize Cloudman instance for the amount of data I am analyzing (250GB of transcriptome FASTQ files). I would run Docker Galaxy on the institute cluster, but it's a bit of a mess too at the moment, hence the use of a local instance. With this particular project, I have become acutely aware that my institute doesn't have all the computational resources in place to handle big data projects. I am currently working to figure out how to utilize other resources, but I still have to get the data analyzed in the process. As for admin support, it hasn't been too bad getting my instance setup and configured. Not a computational pro, but not a complete novice. Work more in R than python. The only part confounding me was the DM and genome access. I'll be the only one using this instance so I'm ok with doing my own admin support until I can figure out how to get access to Cloudman. Thanks again for the help and detailed answer. I'll let you know how it goes. Best, Heather

ADD COMMENT • link written 12 months ago by hmb106 • 10

12 months ago by

hmb106 • 10

hmb106 • 10 wrote:

Jen or anyone else,

I tested the instructions above to load hg38 (new install) and got through the two-bit indexes on my laptop (OS 10.10.5 with python 2.7.10). I'm now hung up on generating the Bowtie indexes. How long should it take and do I need samtools installed to accomplish this? It just hangs on the laptop for hours. I know my Laptop isn't quite the computational powerhouse, but the other index files didn't take that long to make.

More importantly, I'm trying to mimic this on my mac desktop where I had the original issue (OS 10.12.6 with python 2.7.10) which is where I ideally want to run the local instance of Galaxy. However, I cannot for the life of me get it to work. I have checked all the files and am still getting the above error message. There are no differences in the galaxy files as far as I can see. The only differences are the OS and that I installed the laptop DMs at home on my network and make all the indexes that worked at home. The desktop is on the hospital network behind a firewall.I noticed some hang time while installing the tools from toolshed, but didn't think it would affect anything else, just the download times. Would this affect the ability of the DMs to work? In the tutorials and diagrams, I assumed this would/could all be done offline and not require a network or access to process the downloaded genome.fa file. Have I misunderstood how the indexes are generated? I can fetch the genome with the CreateDBkey/Ref Genome and looking over the file, no issues and it is all there. But sam index DM won't run and still gives me the above error that it can't find the file. I cannot figure out where the problem is on the desktop. I tried to get the Bowtie2 indexes from iGenomes for UCSC hg38 as a work around. I tried load those in galaxy using the data libraries with the idea to load into my history and use from there but I can't get those uploaded using admin account to wrangle the data libraries. Double checked those files too all there and all ok. They seem to start loading then quit and act as if done, although no data is imported? I can load my data just fine although it takes awhile, but indexes or genome, nope, not on the desktop. Haven't tried the laptop yet. Bench work was calling. sigh.

Also, if I would happen to get this all wrangled on the laptop, is it possible to just copy the galaxy folder in it's entirety over to the desktop? I was told this wouldn't be that hard to install (for someone with my skill level) for running Bowtie2/DESeq2 rather than using the main instance or cloud (given the firewall issues I have).

As always, many thanks to whoever answers this!! Best, Heather

ADD COMMENT • link modified 12 months ago • written 12 months ago by hmb106 • 10

Please log in to add an answer.

Similar posts • Search »