Question: Concurrent job limit ... but where?
0
gravatar for christophe.habib
2.7 years ago by
France
christophe.habib340 wrote:

Hello everyone,

I am meeting a problem with the number of concurrent job running on my local instance of galaxy. Currently, I can run simultaneously at most 10 jobs. But it is clearly underusing the capability of my server. I am using the paste method with the default configuration (1 server 1 handler) and my runner is HTCondor.

I looked into simple solutions thanks to a conversation pvh_sa on the irc:

  • set "concurrent_jobs" and "registered_user_concurrent_jobs" to 30 in the <limits> block of the job_conf.xml
  • increasing the number of threadpool_workers to 20 in the galaxy.ini
  • increasing the number of htcondor worker from 4 to 10.

But none of them worked. So I wonder what is limiting the number of concurrent jobs to 10 in my instance ? I was thinking about switching from paste to uwsgi to add servers and handlers, but since I don't know where is the limit of concurrent jobs I am not sure that would solve my problem. I would like to understand this before adding complexity.

Your help regarding the understanding of this problem would be very much appreciated.

Thank you !

Christ

ADD COMMENTlink modified 2.7 years ago • written 2.7 years ago by christophe.habib340
1

Hi Christophe

Just double checking: is the "job_config_file" setting in the 'galaxy.ini' file pointing to the right 'job_conf.xml' file?

Regards, Hans-Rudolf

ADD REPLYlink written 2.7 years ago by Hotz, Hans-Rudolf1.8k

This might be helpful, if the limit is within htcondor itself: http://georgi.hristozov.net/2015/08/28/increasing-the-number-of-shared-port-workers-in-htcondor.html

ADD REPLYlink written 2.7 years ago by Jennifer Hillman Jackson25k

Hi, I double checked the galaxy.ini and the job_config_file is pointing to the right job_conf.xml where I set the user concurrent job. So the limit isn't here :(

Concerning the limit within htcondor itself I do not reach the limit of 50 workers. I tried from 4 to 10 workers with no change on the limit on the number of concurrent job observed.

So I wondered if this limit was set in the Debian system itself, but it makes no sens to have a limit to 10 jobs when u have 32 slots available ...

None of you observed this limit when building an instance from scratch and trying to launch a lot of analysis ?

ADD REPLYlink modified 2.7 years ago • written 2.7 years ago by christophe.habib340

"So I wondered if this limit was set in the Debian system itself, but it makes no sens to have a limit to 10 jobs when u have 32 slots available ..."

...well, this rings a (non-galaxy-related) bell: We recently bought a box with 48 cpus. When we tried running STAR using more than 16 cpus it broke. As it turns out: "nofile" (ie the limit on the number of files that a single process can have open at a time ) was set to 16. I realize, your problem is the number of jobs and not the number of open files, but maybe....

And just to double check: what is the output of "ulimit -u"

ADD REPLYlink written 2.7 years ago by Hotz, Hans-Rudolf1.8k

How do you get the nofile information ? The ulimit -u gives : 1033427 . But I don't understand what information this command provides..

ADD REPLYlink modified 2.7 years ago • written 2.7 years ago by christophe.habib340

run "ulimit -a" to get mor explanations and check the file: "/etc/security/limits.conf" - though I am not sure this is all the same for Debian

ADD REPLYlink written 2.7 years ago by Hotz, Hans-Rudolf1.8k

Here is what I obtain from the command ulimit -a

core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 1033427
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 65536
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 1033427
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

The max user processes is huge, and I see nothing here that could be a limit for galaxy or any unix user..

The limits.conf file is entirely commented, and it contains no information regarding any default value for users.

Do you know if there is a limit of jobs per handler in an instance ?

ADD REPLYlink written 2.7 years ago by christophe.habib340

I am sorry, but I have reached my 'sys-admin knowledge' :(

ADD REPLYlink written 2.7 years ago by Hotz, Hans-Rudolf1.8k

I wonder about your experience with galaxy.

Do you directly set your instance with several servers and handlers ? That would explain why you never observed this limit before if this is the cause of my problem.

ADD REPLYlink written 2.7 years ago by christophe.habib340

No, it is all on one box, with several "LocalJobRunner"

ADD REPLYlink written 2.7 years ago by Hotz, Hans-Rudolf1.8k
0
gravatar for christophe.habib
2.7 years ago by
France
christophe.habib340 wrote:

Just on note, this limit is for the whole instance of galaxy. Several users launching analysis at the same time will have to wait for others jobs to finish before that their own jobs start.

ADD COMMENTlink written 2.7 years ago by christophe.habib340
0
gravatar for christophe.habib
2.7 years ago by
France
christophe.habib340 wrote:

So I found what was wrong with my instance. The problem is in the HTCondor configuration in the /etc/condor/condor_config.local . I had this :

NUM_SLOTS = 6
NUM_SLOTS_TYPE_1 = 3
SLOT_TYPE_1 = cpus=8
SLOT_TYPE_1_PARTITIONABLE = true
NUM_SLOTS_TYPE_2 = 3
SLOT_TYPE_2 = cpus=2
SLOT_TYPE_2_PARTITIONABLE = true

It looks like he didn't manage to use all the CPUs of each slot. So I tried to change the config to the following one, 1 slot with all the cpus available. Here, he was able to run at most 2 jobs.

NUM_SLOTS = 1
NUM_SLOTS_TYPE_1 = 1
SLOT_TYPE_1 = cpus=30
SLOT_TYPE_1_PARTITIONABLE = true

So I modified again to 30 slots with only 1 CPUs

NUM_SLOTS = 30
NUM_SLOTS_TYPE_1 = 30
SLOT_TYPE_1 = cpus=1
SLOT_TYPE_1_PARTITIONABLE = true

And here the 24 workflows worked concurrently. But this solution is unsatisfying since multi-cpus analysis will be stuck at grey with this configuration.

Do you know how to set multi-CPUs slot in HTCondor to be able to use all the CPUs in each slot even by 1 CPU jobs ?

ADD COMMENTlink modified 2.7 years ago • written 2.7 years ago by christophe.habib340

I found the right way to write this in my condor conf file :

NUM_SLOTS=1
NUM_SLOTS_TYPE_1=1
SLOT_TYPE_1=100%
SLOT_TYPE_1_PARTITIONABLE=true

Here all my workflows are working concurrently and all the ressources are used. :)

ADD REPLYlink written 2.7 years ago by christophe.habib340
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 169 users visited in the last hour