Concurrent job limit ... but where?

Question: Concurrent job limit ... but where?

2.7 years ago by

France

Hello everyone,

I am meeting a problem with the number of concurrent job running on my local instance of galaxy. Currently, I can run simultaneously at most 10 jobs. But it is clearly underusing the capability of my server. I am using the paste method with the default configuration (1 server 1 handler) and my runner is HTCondor.

I looked into simple solutions thanks to a conversation pvh_sa on the irc:

set "concurrent_jobs" and "registered_user_concurrent_jobs" to 30 in the <limits> block of the job_conf.xml
increasing the number of threadpool_workers to 20 in the galaxy.ini
increasing the number of htcondor worker from 4 to 10.

But none of them worked. So I wonder what is limiting the number of concurrent jobs to 10 in my instance ? I was thinking about switching from paste to uwsgi to add servers and handlers, but since I don't know where is the limit of concurrent jobs I am not sure that would solve my problem. I would like to understand this before adding complexity.

Your help regarding the understanding of this problem would be very much appreciated.

Thank you !

Christ

limit concurrent job htcondor handlers • 1.6k views

ADD COMMENT • link •

modified 2.7 years ago • written 2.7 years ago by christophe.habib • 340

Hi Christophe

Just double checking: is the "job_config_file" setting in the 'galaxy.ini' file pointing to the right 'job_conf.xml' file?

Regards, Hans-Rudolf

ADD REPLY • link written 2.7 years ago by Hotz, Hans-Rudolf • 1.8k

This might be helpful, if the limit is within htcondor itself: http://georgi.hristozov.net/2015/08/28/increasing-the-number-of-shared-port-workers-in-htcondor.html

ADD REPLY • link written 2.7 years ago by Jennifer Hillman Jackson ♦ 25k

Hi, I double checked the galaxy.ini and the job_config_file is pointing to the right job_conf.xml where I set the user concurrent job. So the limit isn't here :(

Concerning the limit within htcondor itself I do not reach the limit of 50 workers. I tried from 4 to 10 workers with no change on the limit on the number of concurrent job observed.

So I wondered if this limit was set in the Debian system itself, but it makes no sens to have a limit to 10 jobs when u have 32 slots available ...

None of you observed this limit when building an instance from scratch and trying to launch a lot of analysis ?

ADD REPLY • link modified 2.7 years ago • written 2.7 years ago by christophe.habib • 340

"So I wondered if this limit was set in the Debian system itself, but it makes no sens to have a limit to 10 jobs when u have 32 slots available ..."

...well, this rings a (non-galaxy-related) bell: We recently bought a box with 48 cpus. When we tried running STAR using more than 16 cpus it broke. As it turns out: "nofile" (ie the limit on the number of files that a single process can have open at a time ) was set to 16. I realize, your problem is the number of jobs and not the number of open files, but maybe....

And just to double check: what is the output of "ulimit -u"

ADD REPLY • link written 2.7 years ago by Hotz, Hans-Rudolf • 1.8k

How do you get the nofile information ? The ulimit -u gives : 1033427 . But I don't understand what information this command provides..

ADD REPLY • link modified 2.7 years ago • written 2.7 years ago by christophe.habib • 340

run "ulimit -a" to get mor explanations and check the file: "/etc/security/limits.conf" - though I am not sure this is all the same for Debian

ADD REPLY • link written 2.7 years ago by Hotz, Hans-Rudolf • 1.8k

Here is what I obtain from the command ulimit -a

core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 1033427
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 65536
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 1033427
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

The max user processes is huge, and I see nothing here that could be a limit for galaxy or any unix user..

The limits.conf file is entirely commented, and it contains no information regarding any default value for users.

Do you know if there is a limit of jobs per handler in an instance ?

ADD REPLY • link written 2.7 years ago by christophe.habib • 340

I am sorry, but I have reached my 'sys-admin knowledge' :(

ADD REPLY • link written 2.7 years ago by Hotz, Hans-Rudolf • 1.8k

I wonder about your experience with galaxy.

Do you directly set your instance with several servers and handlers ? That would explain why you never observed this limit before if this is the cause of my problem.

ADD REPLY • link written 2.7 years ago by christophe.habib • 340

No, it is all on one box, with several "LocalJobRunner"

ADD REPLY • link written 2.7 years ago by Hotz, Hans-Rudolf • 1.8k

2.7 years ago by

christophe.habib • 340

France

christophe.habib • 340 wrote:

So I found what was wrong with my instance. The problem is in the HTCondor configuration in the /etc/condor/condor_config.local . I had this :

NUM_SLOTS = 6
NUM_SLOTS_TYPE_1 = 3
SLOT_TYPE_1 = cpus=8
SLOT_TYPE_1_PARTITIONABLE = true
NUM_SLOTS_TYPE_2 = 3
SLOT_TYPE_2 = cpus=2
SLOT_TYPE_2_PARTITIONABLE = true

It looks like he didn't manage to use all the CPUs of each slot. So I tried to change the config to the following one, 1 slot with all the cpus available. Here, he was able to run at most 2 jobs.

NUM_SLOTS = 1
NUM_SLOTS_TYPE_1 = 1
SLOT_TYPE_1 = cpus=30
SLOT_TYPE_1_PARTITIONABLE = true

So I modified again to 30 slots with only 1 CPUs

NUM_SLOTS = 30
NUM_SLOTS_TYPE_1 = 30
SLOT_TYPE_1 = cpus=1
SLOT_TYPE_1_PARTITIONABLE = true

And here the 24 workflows worked concurrently. But this solution is unsatisfying since multi-cpus analysis will be stuck at grey with this configuration.

Do you know how to set multi-CPUs slot in HTCondor to be able to use all the CPUs in each slot even by 1 CPU jobs ?

ADD COMMENT • link modified 2.7 years ago • written 2.7 years ago by christophe.habib • 340

I found the right way to write this in my condor conf file :

NUM_SLOTS=1
NUM_SLOTS_TYPE_1=1
SLOT_TYPE_1=100%
SLOT_TYPE_1_PARTITIONABLE=true

Here all my workflows are working concurrently and all the ressources are used. :)

ADD REPLY • link written 2.7 years ago by christophe.habib • 340

Similar posts • Search »