A server for conservation genomics

Question: A server for conservation genomics

15 months ago by

United States

tiagoantao • 30 wrote:

We are currently trying to design a server/cluster for a public conservation genomics galaxy server.

Our initial expectation (guestimate) is around 200-300 users mostly using RAD-Seq data (which tends to have way smaller data sizes than WGS). This mostly involves bwa freebayes and stacks, but to levels of usage that are less intensive than with WGS/exome capture/etc.

While the above description is skimp, I was wondering if someone could make suggestions on the hardware for this?

Thanks

public admin hardware galaxy rad-seq • 517 views

ADD COMMENT • link •

modified 15 months ago by Bjoern Gruening ♦ 5.1k • written 15 months ago by tiagoantao • 30

15 months ago by

Bjoern Gruening ♦ 5.1k

Germany

Bjoern Gruening ♦ 5.1k wrote:

Hi Tiago,

such questions are really hard to answer as you can run such a service with just 100 cores and then people need to wait a little bit longer or you can easily add a few hundreds more to reduce the wait time. I would say BWA or BWA-mem should be finished in a few hours given 4 cores. 4 cores x 20 users = 80 cores. This should be enough. From my experience people are happy to wait if this is needed and they schedule there jobs over night or over weekends. So keep the queue busy and calculate with that. Keep in mind that mapping needs some memory, so these mapping nodes should have enough >24GB. A few smaller once are also good for text-processing jobs or stacks.

Galaxy on the other side does not need much resources, so this is negotiable.

Storage is an other beast. Do you want to keep data over years? Are you fine that you users delete histories after a project is finished? What is your archive strategy? If you communicate this well people can deal with 250G quota and delete intermediate results as they can reproduce it with Galaxy at any time. So for 200 users, 20-30 TB should be enough but maybe buy it in a way that you can easily extend it.

Not sure this helps it's a complicated topic :(

Bjoern

ADD COMMENT • link written 15 months ago by Bjoern Gruening ♦ 5.1k

The setup that we currently have is based on a single very good machine, and if possible I would like to maintain that. But I am wondering if that could be a serious bottleneck? Say a machine with 6T memory and 128 cores. Managing a single machine is way simpler than a cluster.

Disk space is a problem but I think we can budget 1 PB for that.

ADD REPLY • link written 15 months ago by tiagoantao • 30

This machine feels like a great start for your purpose.

ADD REPLY • link written 15 months ago by Martin Čech ♦♦ 4.9k

15 months ago by

Martin Čech ♦♦ 4.9k

United States

Martin Čech ♦♦ 4.9k wrote:

Galaxy itself is not resource hungry - unless there are many people using the interface concurrently the vast majority of your needs will come up from the job running. So here is in important question:

How many jobs do you expect and how long do they usually take on what hardware? This should enable us to do some basic math and proceed.

Also do you have some hardware already at your disposal?

ADD COMMENT • link written 15 months ago by Martin Čech ♦♦ 4.9k

We already have an internal server with much less users than we would have for this but with a much broader scope.

In this case I suspect it would be bwa+gatk/freebayes+stacks. Other tools (more on the downstream for analysis) would also be there but they are probably less intensive.

So I would say something on the order of the computational cost of a low coverage exome capture per user (mostly cost of bwa+freebayes). The number of users is quite difficult to estimate but lets put that on the order of 100-200. I suspect peak usage will be low (like 10 concurrent users) but disk usage might be high (1 TB per user at least).

Thanks!

ADD REPLY • link written 15 months ago by tiagoantao • 30

Please log in to add an answer.

Similar posts • Search »