Question: BLASTing on the cloud
0
gravatar for mjmv
4.3 years ago by
mjmv0
United States
mjmv0 wrote:

I've not run a stand-alone version of Galaxy, nor a cloud-based one, at present.  However, I have a reference file that I wish to use to BLAST some sRNA libraries against...and know from experience that this takes a long time to do under normal operating procedure.

I am curious about whether the package for cloud based Galaxy comes standard with BLAST+?  If so, can this be parallelized, as can the Desktop package?  And if so...how many nodes can be supported? 

Also, the documentation for the tool mentions a 'blast_datatypes' repository, and references the NCBI nucleotide/protein databases.  It would seem that this is in reference to the most up-to-date collection of sequences on NCBI that one can download, and that it must be done by the user.  However, if that were the case, I don't see why the need for anything outside of BLAST+ (and wrappers for it to run on Galaxy), as one would treat these as any other subset of sequences they desired to make a database to BLAST against, using the 'makeblastdb' command...no?    I will, at a later point, likely want to perform BLASTs against the most recent nucleotide collection.

 

Thanks for your assistance!

M

galaxy • 1.2k views
ADD COMMENTlink modified 4.3 years ago by Peter Cock1.4k • written 4.3 years ago by mjmv0
1
gravatar for Peter Cock
4.3 years ago by
Peter Cock1.4k
European Union
Peter Cock1.4k wrote:

You are asking many questions at once - this Q&A style site works better with one question at a time ;)

Regarding BLAST databases, the BLAST+ wrappers include makeblastdb so that users can upload a FASTA file and make their own BLAST database which appears in their Galaxy History as a new dataset.

However, there are a number of commonly used BLAST databases which are best pre-installed by the Galaxy Administrator to avoid duplication. We have copies of the big NCBI databases like NR and NT, plus lots of organism specific databases of interest to the researchers using our Galaxy. These are setup via the blastdb.loc and blastdb_p.loc files.

ADD COMMENTlink modified 4.3 years ago • written 4.3 years ago by Peter Cock1.4k
0
gravatar for Jennifer Hillman Jackson
4.3 years ago by
United States
Jennifer Hillman Jackson25k wrote:

Hello,

A CloudMan Galaxy does not come with BLAST+ (or Megablast) already installed, but obtaining these tools is fairly simple to do through the administration interface. The BLAST+ repository in the Tool Shed is named "ncbi_blast_plus". The documentation explains set-up, including databases. Megablast is named "megablast_wrapper". The "blast_datatypes" repository is supplemental, as are other additions and variations. For target databases, you can obtain the indexes that are already available from NCBI as downloads and use those. Standard blast indexes created using command-line index creation (or the Tool Shed wrapper for this) are also fine.

I believe that BLAST+ will distribute across 8 nodes, but let's have the developers who worked on the package correct me on the current capabilities and/or necessary configuration to achieve that (will likely post an additional reply, but if I don't see one soon - I'll ask for clarification directly).

Links to get started (it sounds like you have already seen these, but I'll post for others interested):
http://usegalaxy.org/cloud
http://usegalaxy.org/toolshed
http://toolshed.g2.bx.psu.edu/

And if you wish to see how we configure database indexes for Megablast on the public Main instance (or use them, model after them), these are available via our rsync server. Follow the "UseGalaxy.org Rsync" link (in wiki below): see the top level "blastdb" directory for data, exact processing instructions, and sample tests. Again at the top level, the "location" directory contains the "blastdb.loc" file used by the tool.

For more general instructions about creating indexes command-line, follow the "Data Preparation" link (also in wiki below). Megablast indexes are explained closer to the end of this page.

Data Integration hub:
http://wiki.galaxyproject.org/Admin/DataIntegration

Best wishes for your project, Jen, Galaxy team

 

ADD COMMENTlink written 4.3 years ago by Jennifer Hillman Jackson25k

Was it agreed that 8 nodes is correct?

When installed, will I have a dropdown to select the number of cores I will use? I don't want to select an AWS configuration that exceeds capacity, as I will be charged more for my usage time.

Also...on the main instance for Galaxy...how many cores and RAM are used for BLASTs? I can use this information to estimate the run time and cost for more cores and/or more RAM.

Thanks! M


From: Jennifer Hillman Jackson on Galaxy Biostar [notifications@biostars.org] Sent: Monday, August 04, 2014 9:40 PM To: mjmv Subject: [galaxy-biostar] A: BLASTing on the cloud

Activity on a post you are following on Galaxy Biostar<http: biostar.usegalaxy.org="">

User Jennifer Hillman Jackson<http: biostar.usegalaxy.org="" u="" 254=""/> wrote Answer: BLASTing on the cloud<http: biostar.usegalaxy.org="" p="" 8523="" #8526="">:

Hello,

A CloudMan Galaxy does not come with BLAST+ (or Megablast) already installed, but obtaining these tools is fairly simple to do through the administration interface. The BLAST+ repository in the Tool Shed is named "ncbi_blast_plus". The documentation explains set-up, including databases. Megablast is named "megablast_wrapper". The "blast_datatypes" repository is supplemental, as are other additions and variations. For target databases, you can obtain the indexes that are already available from NCBI as downloads and use those. Standard blast indexes created using command-line index creation (or the Tool Shed wrapper for this) are also fine.

I believe that BLAST+ will distribute across 8 nodes, but let's have the developers who worked on the package correct me on the current capabilities and/or necessary configuration to achieve that (will likely post an additional reply, but if I don't see one soon - I'll ask for clarification directly).

Links to get started (it sounds like you have already seen these, but I'll post for others interested): http://usegalaxy.org/cloud http://usegalaxy.org/toolshed http://toolshed.g2.bx.psu.edu/

And if you wish to see how we configure database indexes for Megablast on the public Main instance (or use them, model after them), these are available via our rsync server. Follow the "UseGalaxy.org Rsync" link (in wiki below): see the top level "blastdb" directory for data, exact processing instructions, and sample tests. Again at the top level, the "location" directory contains the "blastdb.loc" file used by the tool.

For more general instructions about creating indexes command-line, follow the "Data Preparation" link (also in wiki below). Megablast indexes are explained closer to the end of this page.

Data Integration hub: http://wiki.galaxyproject.org/Admin/DataIntegration

Best wishes for your project, Jen, Galaxy team

ADD REPLYlink written 4.3 years ago by mjmv0
1

BLAST+ isn't installed on the main Galaxy at http://usegalaxy.org but using 8 cores for each BLAST jobs (via the $GALAXY_SLOTS setting) is sensible. We use only four cores for BLAST+ on our local Galaxy, but this is in order to use some of the older nodes on our cluster, see http://www.slideshare.net/pjacock/galaxy-blast-gcc2014 and the video http://jhupilot.hosted.panopto.com/Panopto/Pages/Viewer.aspx?id=2357e7e4-3582-4d07-a2fe-4bd0da28fc49 of my talk at GCC2014.

Note that the number of cores is controlled by the Galaxy Administrator, not the Galaxy user.

ADD REPLYlink modified 4.3 years ago • written 4.3 years ago by Peter Cock1.4k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 171 users visited in the last hour