Question

Question: Slurm Drmaa configuration for Galaxy

0

10 weeks ago by

pks71500 • 20

pks71500 • 20 wrote:

Hello, I have followed the below links to configure Slurm for Galaxy.

https://biostar.usegalaxy.org/p/19543/

http://gmod.827538.n3.nabble.com/Running-Galaxy-on-a-cluster-with-SLURM-td4051302.html

I can successfully submit a job through slurm-drmaa or python-drmaa, but not from Galaxy. Galaxy only shows "This job is waiting to run."

When I ran run.sh I saw following log message.

galaxy.jobs.manager DEBUG 2018-09-19 12:50:33,518 [p:10604,w:0,m:0] [MainThread] Initializing job handler
galaxy.jobs INFO 2018-09-19 12:50:33,518 [p:10604,w:0,m:0] [MainThread] Handler 'main' will load specified runner plugins: slurm
galaxy.jobs.runners.state_handler_factory DEBUG 2018-09-19 12:50:33,520 [p:10604,w:0,m:0] [MainThread] Loaded 'failure' state handler from module galaxy.jobs.runners.state_handlers.resubmit
I #296c [     0.00]  * logging started at: 2018-09-19 12:50:33.52 Z
t #296c [     0.00] -> fsd_exc_init
t #296c [     0.00] <- fsd_exc_init
t #296c [     0.00] -> drmaa_init(contact=(null))
t #296c [     0.00] -> fsd_drmaa_session_new((null))
t #296c [     0.00] -> fsd_job_set_new()
t #296c [     0.00] <- fsd_job_set_new =0x4b87e30
t #296c [     0.00] -> fsd_conf_read(filename=/etc/slurm_drmaa.conf, must_exist=false, content=(null))
t #296c [     0.00]  * content from file
t #296c [     0.00] <- fsd_conf_read
t #296c [     0.00] -> fsd_conf_read(filename=/root/.slurm_drmaa.conf, must_exist=false, content=(null))
t #296c [     0.00] <- fsd_conf_read
t #296c [     0.00] -> fsd_drmaa_session_apply_configuration
t #296c [     0.00] <- fsd_drmaa_session_apply_configuration
t #296c [     0.00] <- drmaa_init =0

When I launched a job I saw following log message.

galaxy.tools.actions.upload DEBUG 2018-09-19 13:51:06,388 [p:11274,w:1,m:0] [uWSGIWorker1Core1] Checked uploads (621.864 ms)
galaxy.tools.actions.upload_common INFO 2018-09-19 13:51:06,500 [p:11274,w:1,m:0] [uWSGIWorker1Core1] tool upload1 created job id 6
galaxy.tools.actions.upload DEBUG 2018-09-19 13:51:06,633 [p:11274,w:1,m:0] [uWSGIWorker1Core1] Created upload job (244.750 ms)
galaxy.tools.execute DEBUG 2018-09-19 13:51:06,633 [p:11274,w:1,m:0] [uWSGIWorker1Core1] Tool [upload1] created job [6] (867.546 ms)
galaxy.tools.execute DEBUG 2018-09-19 13:51:06,657 [p:11274,w:1,m:0] [uWSGIWorker1Core1] Executed 1 job(s) for tool upload1 request: (907.967 ms)

Here is my job_conf.xml file.

<?xml version="1.0"?>
<!-- A sample job config that explicitly configures job running the way it is configured by default (if there is no explicit config). -->
<job_conf>
<plugins workers="10">
    <plugin id="slurm" type="runner" load="galaxy.jobs.runners.slurm:SlurmJobRunner"/>
        <param id="drmaa_library_path">/usr/local/lib/libdrmaa.so</param>
</plugins>

<handlers default="handlers">
    <handler id="main" tags="handlers">
        <plugin id="slurm"/>
    </handler>
</handlers>

<destinations default="slurm">
    <destination id="slurm" runner="slurm">
        <param id="request_cpus">1</param>
            <param id="embed_metadata_in_job">False</param>
            <param id="nativeSpecification">-p standard </param>
            <env file="/srv/galaxy/.venv/bin/activate" />
  </destination>
</destinations>
</job_conf>

Any help would be appreciated. Thanks.

slurm drmaa • 149 views

ADD COMMENT • link •

modified 10 weeks ago • written 10 weeks ago by pks71500 • 20

I have updated job_conf.xml file and I can now see that Galaxy tries to submit a job. But, it fails with Invalid user id.

<?xml version="1.0"?>
<!-- A sample job config that explicitly configures job running the way it is configured by default (if there is no explicit config). -->
<job_conf>
<plugins workers="10">
        <plugin id="local" type="runner" load="galaxy.jobs.runners.local:LocalJobRunner"/>
    <plugin id="slurm" type="runner" load="galaxy.jobs.runners.slurm:SlurmJobRunner"/>
        <param id="drmaa_library_path">/usr/local/lib/libdrmaa.so</param>
</plugins>

<handlers default="handlers">
    <handler id="main"/>
</handlers>

<destinations default="slurm">
        <destination id="local" runner="local">
        </destination>
    <destination id="slurm" runner="slurm">
            <env file="/srv/galaxy/.venv/bin/activate" />
        </destination>
</destinations>
</job_conf>



d #2dcc [    77.50]  * # Job category (null) : -J galaxy -p standard
t #2dcc [    77.50] -> slurmdrmaa_parse_native
d #2dcc [    77.50]  * # job_name = g11_upload1_gp4r
d #2dcc [    77.50]  * # partition = standard
t #2dcc [    77.50] <- slurmdrmaa_parse_native
E #2dcc [    77.51]  * fsd_exc_new(1001,slurm_submit_batch_job: Invalid user id,1)
t #2dcc [    77.51] -> slurmdrmaa_free_job_desc
t #2dcc [    77.51] <- slurmdrmaa_free_job_desc
t #2dcc [    77.51] <- drmaa_run_job=1: slurm_submit_batch_job: Invalid user id
t #2dcc [    77.51] -> drmaa_delete_job_template(0x56cabb0)
t #2dcc [    77.51] <- drmaa_delete_job_template =0
galaxy.jobs.runners.drmaa WARNING 2018-09-19 14:37:00,595 [p:11698,w:1,m:0] [SlurmRunner.work_thread-0] (11) drmaa.Session.runJob() failed, will retry: code 1: slurm_submit_batch_job: Invalid user id

I use PAM authentication and the user can submit a job via terminal. Anyone help me to pass user id correctly?

ADD REPLY • link written 10 weeks ago by pks71500 • 20

Are you running the latest release 18.05? https://docs.galaxyproject.org/en/master/releases/18.05_announce.html

There was a fix for running jobs as the "real user" in a VM: https://github.com/galaxyproject/galaxy/pull/5881#issue-181255818

The most current doc is now published here: https://docs.galaxyproject.org/en/master/admin/cluster.html#submitting-jobs-as-the-real-user

We can follow up troubleshooting after you check that your config matches what is published. I am fairly certain that using a yaml config is needed (instead of the prior ini). We may ask you to share that and get the developers involved.

ADD REPLY • link written 10 weeks ago by Jennifer Hillman Jackson ♦ 25k

Thanks, Jennifer.

I have followed the link and now Galaxy submits with a real user id, but drmaa fails with unknown error.

#688f [    54.59]  * updating status of job: 2691203
t #688f [    54.59] -> slurmdrmaa_job_update_status({job_id=2691203})
d #688f [    54.59]  * state = 0, state_reason = 1
d #688f [    54.59]  * interpreting as DRMAA_PS_QUEUED_ACTIVE
t #688f [    54.59] <- slurmdrmaa_job_update_status
t #688f [    54.59] -> fsd_job_release(0x7ffb1800e590={job_id=2691203, ref_cnt=1}) [unlock 2691203]
t #688f [    54.59] -> fsd_job_destroy(0x7ffb1800e590={job_id=2691203})
t #688f [    54.59] <- fsd_job_destroy
t #688f [    54.59] <- fsd_job_release
t #688f [    54.59] <- drmaa_job_ps(job_id=2691203) =0: remote_ps=queued_active (0x10)
t #688f [    55.60] -> drmaa_job_ps(job_id=2691203)
t #688f [    55.60] -> fsd_job_set_get(job_id=2691203)
t #688f [    55.60] <- fsd_job_set_get(job_id=2691203) =NULL
I #688f [    55.60]  * job_ps: recreating job object: 2691203
t #688f [    55.60] -> fsd_job_new(2691203)
t #688f [    55.60] <- fsd_job_new=0x7ffb18017f00: ref_cnt=1 [lock 2691203]
d #688f [    55.60]  *  job->last_update_time = 0
d #688f [    55.60]  * updating status of job: 2691203
t #688f [    55.60] -> slurmdrmaa_job_update_status({job_id=2691203})
d #688f [    55.61]  * state = 5, state_reason = 23
d #688f [    55.61]  * interpreting as DRMAA_PS_FAILED
d #688f [    55.61]  * exit_status = 256 -> 1
d #688f [    55.61]  * exit_status = 256, WEXITSTATUS(exit_status) = 1
t #688f [    55.61] <- slurmdrmaa_job_update_status
t #688f [    55.61] -> fsd_job_release(0x7ffb18017f00={job_id=2691203, ref_cnt=1}) [unlock 2691203]
t #688f [    55.61] -> fsd_job_destroy(0x7ffb18017f00={job_id=2691203})
t #688f [    55.61] <- fsd_job_destroy
t #688f [    55.61] <- fsd_job_release
t #688f [    55.61] <- drmaa_job_ps(job_id=2691203) =0: remote_ps=failed (0x40)
galaxy.jobs.runners.drmaa DEBUG 2018-10-24 20:20:18,474 [p:26758,w:1,m:0] [Dummy-5] (24/2691203) state change: job finished, but failed
galaxy.jobs.runners.slurm WARNING 2018-10-24 20:20:18,528 [p:26758,w:1,m:0] [Dummy-5] (24/2691203) Job failed due to unknown reasons, job state in SLURM was: FAILED
galaxy.jobs DEBUG 2018-10-24 20:20:18,594 [p:26758,w:1,m:0] [SlurmRunner.work_thread-1] fail(): Moved /srv/galaxy/database/working_dir/000/24/galaxy_dataset_24.dat to /srv/galaxy/database/files/000/dataset_24.dat
galaxy.tools.error_reports DEBUG 2018-10-24 20:20:18,903 [p:26758,w:1,m:0] [SlurmRunner.work_thread-1] Bug report plugin <galaxy.tools.error_reports.plugins.sentry.SentryPlugin object at 0x7ffb21b20cd0> generated response None
galaxy.model.metadata DEBUG 2018-10-24 20:20:18,911 [p:26758,w:1,m:0] [SlurmRunner.work_thread-1] Cleaning up external metadata files
galaxy.model.metadata DEBUG 2018-10-24 20:20:18,930 [p:26758,w:1,m:0] [SlurmRunner.work_thread-1] Failed to cleanup MetadataTempFile temp files from /srv/galaxy/database/working_dir/000/24/metadata_out_HistoryDatasetAssociation_24_6B1uZf: No JSON object could be decoded
galaxy.jobs.runners DEBUG 2018-10-24 20:20:18,972 [p:26758,w:1,m:0] [SlurmRunner.work_thread-1] (24/2691203) Unable to cleanup /srv/galaxy/database/working_dir/000/24/galaxy_24.sh: [Errno 2] No such file or directory: '/srv/galaxy/database/working_dir/000/24/galaxy_24.sh'

18.05 in galaxy.yml, I added the following lines. I have created folders for new_file_path and job_working_directory. I am not sure if there are any issues with lines.

  outputs_to_working_directory: True
  real_system_username: username
  drmaa_external_runjob_script: sudo -E .venv/bin/python scripts/drmaa_external_runner.py --assign_all_groups
  new_file_path: database/file_path
  job_working_directory: database/working_dir

I would appreciate any advice you may have.

ADD REPLY • link written 5 weeks ago by pks71500 • 20

Similar posts • Search »