Question: Slurm Drmaa configuration for Galaxy
0
gravatar for pks71500
10 weeks ago by
pks7150020
pks7150020 wrote:

Hello, I have followed the below links to configure Slurm for Galaxy.

https://biostar.usegalaxy.org/p/19543/

http://gmod.827538.n3.nabble.com/Running-Galaxy-on-a-cluster-with-SLURM-td4051302.html

I can successfully submit a job through slurm-drmaa or python-drmaa, but not from Galaxy. Galaxy only shows "This job is waiting to run."

When I ran run.sh I saw following log message.

galaxy.jobs.manager DEBUG 2018-09-19 12:50:33,518 [p:10604,w:0,m:0] [MainThread] Initializing job handler
galaxy.jobs INFO 2018-09-19 12:50:33,518 [p:10604,w:0,m:0] [MainThread] Handler 'main' will load specified runner plugins: slurm
galaxy.jobs.runners.state_handler_factory DEBUG 2018-09-19 12:50:33,520 [p:10604,w:0,m:0] [MainThread] Loaded 'failure' state handler from module galaxy.jobs.runners.state_handlers.resubmit
I #296c [     0.00]  * logging started at: 2018-09-19 12:50:33.52 Z
t #296c [     0.00] -> fsd_exc_init
t #296c [     0.00] <- fsd_exc_init
t #296c [     0.00] -> drmaa_init(contact=(null))
t #296c [     0.00] -> fsd_drmaa_session_new((null))
t #296c [     0.00] -> fsd_job_set_new()
t #296c [     0.00] <- fsd_job_set_new =0x4b87e30
t #296c [     0.00] -> fsd_conf_read(filename=/etc/slurm_drmaa.conf, must_exist=false, content=(null))
t #296c [     0.00]  * content from file
t #296c [     0.00] <- fsd_conf_read
t #296c [     0.00] -> fsd_conf_read(filename=/root/.slurm_drmaa.conf, must_exist=false, content=(null))
t #296c [     0.00] <- fsd_conf_read
t #296c [     0.00] -> fsd_drmaa_session_apply_configuration
t #296c [     0.00] <- fsd_drmaa_session_apply_configuration
t #296c [     0.00] <- drmaa_init =0

When I launched a job I saw following log message.

galaxy.tools.actions.upload DEBUG 2018-09-19 13:51:06,388 [p:11274,w:1,m:0] [uWSGIWorker1Core1] Checked uploads (621.864 ms)
galaxy.tools.actions.upload_common INFO 2018-09-19 13:51:06,500 [p:11274,w:1,m:0] [uWSGIWorker1Core1] tool upload1 created job id 6
galaxy.tools.actions.upload DEBUG 2018-09-19 13:51:06,633 [p:11274,w:1,m:0] [uWSGIWorker1Core1] Created upload job (244.750 ms)
galaxy.tools.execute DEBUG 2018-09-19 13:51:06,633 [p:11274,w:1,m:0] [uWSGIWorker1Core1] Tool [upload1] created job [6] (867.546 ms)
galaxy.tools.execute DEBUG 2018-09-19 13:51:06,657 [p:11274,w:1,m:0] [uWSGIWorker1Core1] Executed 1 job(s) for tool upload1 request: (907.967 ms)

Here is my job_conf.xml file.

<?xml version="1.0"?>
<!-- A sample job config that explicitly configures job running the way it is configured by default (if there is no explicit config). -->
<job_conf>
<plugins workers="10">
    <plugin id="slurm" type="runner" load="galaxy.jobs.runners.slurm:SlurmJobRunner"/>
        <param id="drmaa_library_path">/usr/local/lib/libdrmaa.so</param>
</plugins>

<handlers default="handlers">
    <handler id="main" tags="handlers">
        <plugin id="slurm"/>
    </handler>
</handlers>

<destinations default="slurm">
    <destination id="slurm" runner="slurm">
        <param id="request_cpus">1</param>
            <param id="embed_metadata_in_job">False</param>
            <param id="nativeSpecification">-p standard </param>
            <env file="/srv/galaxy/.venv/bin/activate" />
  </destination>
</destinations>
</job_conf>

Any help would be appreciated. Thanks.

slurm drmaa • 149 views
ADD COMMENTlink modified 10 weeks ago • written 10 weeks ago by pks7150020

I have updated job_conf.xml file and I can now see that Galaxy tries to submit a job. But, it fails with Invalid user id.

<?xml version="1.0"?>
<!-- A sample job config that explicitly configures job running the way it is configured by default (if there is no explicit config). -->
<job_conf>
<plugins workers="10">
        <plugin id="local" type="runner" load="galaxy.jobs.runners.local:LocalJobRunner"/>
    <plugin id="slurm" type="runner" load="galaxy.jobs.runners.slurm:SlurmJobRunner"/>
        <param id="drmaa_library_path">/usr/local/lib/libdrmaa.so</param>
</plugins>

<handlers default="handlers">
    <handler id="main"/>
</handlers>

<destinations default="slurm">
        <destination id="local" runner="local">
        </destination>
    <destination id="slurm" runner="slurm">
            <env file="/srv/galaxy/.venv/bin/activate" />
        </destination>
</destinations>
</job_conf>



d #2dcc [    77.50]  * # Job category (null) : -J galaxy -p standard
t #2dcc [    77.50] -> slurmdrmaa_parse_native
d #2dcc [    77.50]  * # job_name = g11_upload1_gp4r
d #2dcc [    77.50]  * # partition = standard
t #2dcc [    77.50] <- slurmdrmaa_parse_native
E #2dcc [    77.51]  * fsd_exc_new(1001,slurm_submit_batch_job: Invalid user id,1)
t #2dcc [    77.51] -> slurmdrmaa_free_job_desc
t #2dcc [    77.51] <- slurmdrmaa_free_job_desc
t #2dcc [    77.51] <- drmaa_run_job=1: slurm_submit_batch_job: Invalid user id
t #2dcc [    77.51] -> drmaa_delete_job_template(0x56cabb0)
t #2dcc [    77.51] <- drmaa_delete_job_template =0
galaxy.jobs.runners.drmaa WARNING 2018-09-19 14:37:00,595 [p:11698,w:1,m:0] [SlurmRunner.work_thread-0] (11) drmaa.Session.runJob() failed, will retry: code 1: slurm_submit_batch_job: Invalid user id

I use PAM authentication and the user can submit a job via terminal. Anyone help me to pass user id correctly?

ADD REPLYlink written 10 weeks ago by pks7150020

Are you running the latest release 18.05? https://docs.galaxyproject.org/en/master/releases/18.05_announce.html

There was a fix for running jobs as the "real user" in a VM: https://github.com/galaxyproject/galaxy/pull/5881#issue-181255818

The most current doc is now published here: https://docs.galaxyproject.org/en/master/admin/cluster.html#submitting-jobs-as-the-real-user

We can follow up troubleshooting after you check that your config matches what is published. I am fairly certain that using a yaml config is needed (instead of the prior ini). We may ask you to share that and get the developers involved.

ADD REPLYlink written 10 weeks ago by Jennifer Hillman Jackson25k

Thanks, Jennifer.

I have followed the link and now Galaxy submits with a real user id, but drmaa fails with unknown error.

#688f [    54.59]  * updating status of job: 2691203
t #688f [    54.59] -> slurmdrmaa_job_update_status({job_id=2691203})
d #688f [    54.59]  * state = 0, state_reason = 1
d #688f [    54.59]  * interpreting as DRMAA_PS_QUEUED_ACTIVE
t #688f [    54.59] <- slurmdrmaa_job_update_status
t #688f [    54.59] -> fsd_job_release(0x7ffb1800e590={job_id=2691203, ref_cnt=1}) [unlock 2691203]
t #688f [    54.59] -> fsd_job_destroy(0x7ffb1800e590={job_id=2691203})
t #688f [    54.59] <- fsd_job_destroy
t #688f [    54.59] <- fsd_job_release
t #688f [    54.59] <- drmaa_job_ps(job_id=2691203) =0: remote_ps=queued_active (0x10)
t #688f [    55.60] -> drmaa_job_ps(job_id=2691203)
t #688f [    55.60] -> fsd_job_set_get(job_id=2691203)
t #688f [    55.60] <- fsd_job_set_get(job_id=2691203) =NULL
I #688f [    55.60]  * job_ps: recreating job object: 2691203
t #688f [    55.60] -> fsd_job_new(2691203)
t #688f [    55.60] <- fsd_job_new=0x7ffb18017f00: ref_cnt=1 [lock 2691203]
d #688f [    55.60]  *  job->last_update_time = 0
d #688f [    55.60]  * updating status of job: 2691203
t #688f [    55.60] -> slurmdrmaa_job_update_status({job_id=2691203})
d #688f [    55.61]  * state = 5, state_reason = 23
d #688f [    55.61]  * interpreting as DRMAA_PS_FAILED
d #688f [    55.61]  * exit_status = 256 -> 1
d #688f [    55.61]  * exit_status = 256, WEXITSTATUS(exit_status) = 1
t #688f [    55.61] <- slurmdrmaa_job_update_status
t #688f [    55.61] -> fsd_job_release(0x7ffb18017f00={job_id=2691203, ref_cnt=1}) [unlock 2691203]
t #688f [    55.61] -> fsd_job_destroy(0x7ffb18017f00={job_id=2691203})
t #688f [    55.61] <- fsd_job_destroy
t #688f [    55.61] <- fsd_job_release
t #688f [    55.61] <- drmaa_job_ps(job_id=2691203) =0: remote_ps=failed (0x40)
galaxy.jobs.runners.drmaa DEBUG 2018-10-24 20:20:18,474 [p:26758,w:1,m:0] [Dummy-5] (24/2691203) state change: job finished, but failed
galaxy.jobs.runners.slurm WARNING 2018-10-24 20:20:18,528 [p:26758,w:1,m:0] [Dummy-5] (24/2691203) Job failed due to unknown reasons, job state in SLURM was: FAILED
galaxy.jobs DEBUG 2018-10-24 20:20:18,594 [p:26758,w:1,m:0] [SlurmRunner.work_thread-1] fail(): Moved /srv/galaxy/database/working_dir/000/24/galaxy_dataset_24.dat to /srv/galaxy/database/files/000/dataset_24.dat
galaxy.tools.error_reports DEBUG 2018-10-24 20:20:18,903 [p:26758,w:1,m:0] [SlurmRunner.work_thread-1] Bug report plugin <galaxy.tools.error_reports.plugins.sentry.SentryPlugin object at 0x7ffb21b20cd0> generated response None
galaxy.model.metadata DEBUG 2018-10-24 20:20:18,911 [p:26758,w:1,m:0] [SlurmRunner.work_thread-1] Cleaning up external metadata files
galaxy.model.metadata DEBUG 2018-10-24 20:20:18,930 [p:26758,w:1,m:0] [SlurmRunner.work_thread-1] Failed to cleanup MetadataTempFile temp files from /srv/galaxy/database/working_dir/000/24/metadata_out_HistoryDatasetAssociation_24_6B1uZf: No JSON object could be decoded
galaxy.jobs.runners DEBUG 2018-10-24 20:20:18,972 [p:26758,w:1,m:0] [SlurmRunner.work_thread-1] (24/2691203) Unable to cleanup /srv/galaxy/database/working_dir/000/24/galaxy_24.sh: [Errno 2] No such file or directory: '/srv/galaxy/database/working_dir/000/24/galaxy_24.sh'

18.05 in galaxy.yml, I added the following lines. I have created folders for new_file_path and job_working_directory. I am not sure if there are any issues with lines.

  outputs_to_working_directory: True
  real_system_username: username
  drmaa_external_runjob_script: sudo -E .venv/bin/python scripts/drmaa_external_runner.py --assign_all_groups
  new_file_path: database/file_path
  job_working_directory: database/working_dir

I would appreciate any advice you may have.

ADD REPLYlink written 5 weeks ago by pks7150020
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 178 users visited in the last hour