Question: A server for conservation genomics
gravatar for tiagoantao
15 months ago by
United States
tiagoantao30 wrote:

We are currently trying to design a server/cluster for a public conservation genomics galaxy server.

Our initial expectation (guestimate) is around 200-300 users mostly using RAD-Seq data (which tends to have way smaller data sizes than WGS). This mostly involves bwa freebayes and stacks, but to levels of usage that are less intensive than with WGS/exome capture/etc.

While the above description is skimp, I was wondering if someone could make suggestions on the hardware for this?


ADD COMMENTlink modified 15 months ago by Bjoern Gruening5.1k • written 15 months ago by tiagoantao30
gravatar for Bjoern Gruening
15 months ago by
Bjoern Gruening5.1k
Bjoern Gruening5.1k wrote:

Hi Tiago,

such questions are really hard to answer as you can run such a service with just 100 cores and then people need to wait a little bit longer or you can easily add a few hundreds more to reduce the wait time. I would say BWA or BWA-mem should be finished in a few hours given 4 cores. 4 cores x 20 users = 80 cores. This should be enough. From my experience people are happy to wait if this is needed and they schedule there jobs over night or over weekends. So keep the queue busy and calculate with that. Keep in mind that mapping needs some memory, so these mapping nodes should have enough >24GB. A few smaller once are also good for text-processing jobs or stacks.

Galaxy on the other side does not need much resources, so this is negotiable.

Storage is an other beast. Do you want to keep data over years? Are you fine that you users delete histories after a project is finished? What is your archive strategy? If you communicate this well people can deal with 250G quota and delete intermediate results as they can reproduce it with Galaxy at any time. So for 200 users, 20-30 TB should be enough but maybe buy it in a way that you can easily extend it.

Not sure this helps it's a complicated topic :(


ADD COMMENTlink written 15 months ago by Bjoern Gruening5.1k

The setup that we currently have is based on a single very good machine, and if possible I would like to maintain that. But I am wondering if that could be a serious bottleneck? Say a machine with 6T memory and 128 cores. Managing a single machine is way simpler than a cluster.

Disk space is a problem but I think we can budget 1 PB for that.

ADD REPLYlink written 15 months ago by tiagoantao30

This machine feels like a great start for your purpose.

ADD REPLYlink written 15 months ago by Martin Čech ♦♦ 4.9k
gravatar for Martin Čech
15 months ago by
Martin Čech ♦♦ 4.9k
United States
Martin Čech ♦♦ 4.9k wrote:

Galaxy itself is not resource hungry - unless there are many people using the interface concurrently the vast majority of your needs will come up from the job running. So here is in important question:

How many jobs do you expect and how long do they usually take on what hardware? This should enable us to do some basic math and proceed.

Also do you have some hardware already at your disposal?

ADD COMMENTlink written 15 months ago by Martin Čech ♦♦ 4.9k

We already have an internal server with much less users than we would have for this but with a much broader scope.

In this case I suspect it would be bwa+gatk/freebayes+stacks. Other tools (more on the downstream for analysis) would also be there but they are probably less intensive.

So I would say something on the order of the computational cost of a low coverage exome capture per user (mostly cost of bwa+freebayes). The number of users is quite difficult to estimate but lets put that on the order of 100-200. I suspect peak usage will be low (like 10 concurrent users) but disk usage might be high (1 TB per user at least).


ADD REPLYlink written 15 months ago by tiagoantao30
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 171 users visited in the last hour