ref genome using Rsync

Question: ref genome using Rsync

2.2 years ago by

ChickenRNA • 50 wrote:

Hi, I am trying to download the reference genome of the chicken on my local instance of galaxy using rsync. I am currently only downloading the allfasta data tables, and it has been more than 24 hours and it stills says running. Is this normal? is there a faster way to bring the reference genome into the local instance of galaxy to perform RNASeq analysis?

Thank you

local data-managers indexes rsync reference-genome • 1.2k views

ADD COMMENT • link •

modified 2.2 years ago by Jennifer Hillman Jackson ♦ 25k • written 2.2 years ago by ChickenRNA • 50

2.2 years ago by

Jennifer Hillman Jackson ♦ 25k

United States

Jennifer Hillman Jackson ♦ 25k wrote:

Hello,

Instead of using the Rsync server, consider installing the genome with Data Managers (sourced from the Tool Shed http://usegalaxy.org/toolshed). Install DMs like any other tool using the Admin functions.

You'll need these DMs at a minimum, and execute them in this order first:

Fasta fetcher. Tthere are two and often both are needed. If the genome is not listed in the builds list in the Upload tool, use the one that creates a "dbkey"
SAM indexer
Picard indexer
2bit indexer

Then get the DMs that create indexes for the tools you want to use. Run these after the others have completed for the best results.

Thanks, Jen, Galaxy team

Ps: I will check into the Rsych server issues meanwhile. Still, using DMs directly is still the best choice.

ADD COMMENT • link modified 2.2 years ago • written 2.2 years ago by Jennifer Hillman Jackson ♦ 25k

Thank you so much for your prompt response Jennifer! Do I need different DM's for different tools like Tophat? I am trying to set it up so it would be similar to usegalaxy.org, where the reference genome will be available in dropdown in the tools? or do I have to set them up each time? Is there any tutorials you would suggest on this? I am fairly new to all this, so thank you for your patience and guidance

ADD REPLY • link written 2.2 years ago by ChickenRNA • 50

This is the help for Data Managers: https://wiki.galaxyproject.org/Admin/Tools/DataManagers

The idea is to load the genome. This can be any fasta file - including the same ones as on Galaxy Main, if you wish. The full name indicates the exact build. Find this information the Upload tool or by clicking into the pencil icon for any dataset - the list of genomes is included in both places - or you can add in your own custom genome ("dbkey").

To index for tools, do the first steps (load the fasta, do basic indexes), then proceed to tool-specific DMs. At this time, perform the indexing per-genome. Workflowing this type of processing is an enhancement the team is considering to make it all go smoother. You could also do the indexing using a script you create with a Galaxy API: https://wiki.galaxyproject.org/Develop/API.

Once the indexes are created they will be persistent data on your instance. In other words, if Tophat indexes are created (with the Bowtie2 DM using the option "Include Tophat indexes" on the form) - these will be available to all users on that instance.

ADD REPLY • link modified 2.2 years ago • written 2.2 years ago by Jennifer Hillman Jackson ♦ 25k

Update: If the chicken genome(s) are named like galGal3, galGal4, etc .. these are sourced from UCSC. This is a specific data source choice in the fasta fetching data manager tool form.

If you are confused about any genome source found at http://usegalaxy.org, a google with the build name will usually locate the source, but please feel free to write back and we can help guide you.

ADD REPLY • link modified 2.2 years ago • written 2.2 years ago by Jennifer Hillman Jackson ♦ 25k

Thanks Again Jennifer, Is there a particular way of knowing what indexes the specific tools use?

ADD REPLY • link written 2.2 years ago by ChickenRNA • 50

The tool name and the Data Manager are usually named in a way that makes this clear. Is there one you are confused about?

ADD REPLY • link written 2.2 years ago by Jennifer Hillman Jackson ♦ 25k

No I am just in the process of installing all the DM's that you mentioned. I am confused about setting up the indexes, but it may become clear once I get going.

ADD REPLY • link written 2.2 years ago by ChickenRNA • 50

Jennifer, Also does installing these DM's take longer than installing tools from tool sheds? It is taking much longer in my case, so I am not sure if it is normal or something is wrong on my end? Thanks

ADD REPLY • link written 2.2 years ago by ChickenRNA • 50

Not that I have ever noticed, but different tools have different dependencies. The overall load on the Tool Shed can also be a factor as well as how many tools are loaded simultaneously. I suggest starting the install for those you want and then allowing them to complete. Once done, check the status for each to ensure all went as expected.

Update: re-read your post. Do you mean the actual jobs performing the indexing are long running? If so, then these will consume about the same resources (memory, time, compute) as running the indexing line command. Some indexes do take time. If any fail for resource, there was probably not enough available to run the single (sometimes) or concurrent jobs. Just re-run those. I have had a few genomes consume a very large amount of memory, but these were generally very large and/or highly fragmented genomes. Adjusting the parameters can usually help - or providing more resource (most often memory). If parameters are unclear, examine the target tool's documentation. The manual/help should describe how indexes are best created for particular genome build types and recommended resources.

ADD REPLY • link modified 2.2 years ago • written 2.2 years ago by Jennifer Hillman Jackson ♦ 25k

Hi Jennifer, My problem was it was taking long to install the DM's from the toolshed. I will follow your advice and stop them all and install one by one.

ADD REPLY • link written 2.2 years ago by ChickenRNA • 50

You shouldn't need to install the tools one at time. Just allow them to complete and delete/reinstall should any fail.

ADD REPLY • link written 2.2 years ago by Jennifer Hillman Jackson ♦ 25k

Jennifer, Sorry about all the basic and simple questions, but i am very new to all this hence my questions. I have my local galaxy running through a VM on windows (was going to dual boot into linux to use galaxy but it worked with VM). I had selected all the DM's that you mentioned to install using my toolshed and it was was taking several hours (when I download tools it usually doesn't take that long). So I am sure there is something wrong, but I am lost in trying to figure out how to identify it in order to fix it. Thank you so much for all your help, I am learning loads, but the learning curve seems to be steep.

ADD REPLY • link written 2.2 years ago by ChickenRNA • 50

I am not clear by what you mean by "my toolshed"? Could you explain more if the rest of this does not help ....

The DMs should be installed from the Main Tool Shed (hosted at http://usegalaxy.org/toolshed). This is the default tool shed accessed through the Admin install tools function on a local/cloud from http://getgalaxy.org (within a VM or not - Galaxy is not supported on Windows directly). Doing this ensures that the most current version of the DM that works with the most current Galaxy release will install and run correctly.

ADD REPLY • link written 2.2 years ago by Jennifer Hillman Jackson ♦ 25k

By my toolshed, what I had meant was the when I download the DMs using the toolshed on my local instance of Galaxy. When I install it and go to "monitor installing tool shed repositories", the status for the DM's are cloning or "installing dependent repositories" (the exact wording may not be right), for several hours (>24 hours)

Would you suggest dual booting my computer and running this on linux, is there a potential that this may be caused by running it through a VM?

ADD REPLY • link written 2.2 years ago by ChickenRNA • 50

I am not sure what is going on but have asked our team for input to help troubleshoot. Typically when tools take this long to load an uninstall/reinstall can help - but that might be another dead end considering the windows/VM factors. More feedback soon. - Jen

ps: Using a docker image (as asked in your other post here: https://biostar.usegalaxy.org/p/19402/) is one solution to try until that happens - and may be what the team recommends anyway.

ADD REPLY • link written 2.2 years ago by Jennifer Hillman Jackson ♦ 25k

Thnk you Jennifer. Eagerly waiting to hear what the team says about this issue. If running through VM is the problem, I am dual boot into linux ot run this.

ADD REPLY • link modified 2.2 years ago • written 2.2 years ago by ChickenRNA • 50

Similar posts • Search »