sge drmaa cluster galaxy configuration

Question: sge drmaa cluster galaxy configuration

7 months ago by

janecao999 • 0 wrote:

I recently installed a local galaxy. I want to distribute jobs to local (as default and run in headnode) or to a sge cluster. The job_conf.xml was configured as:

<?xml version="1.0"?>
<!-- A sample job config that explicitly configures job running the way it is configured by default (if there is no explicit config). -->
<job_conf>
    <plugins>
        <plugin id="local" type="runner" load="galaxy.jobs.runners.local:LocalJobRunner" workers="4"/>
        <plugin id="drmaa_default" type="runner" load="galaxy.jobs.runners.drmaa:DRMAAJobRunner" />
    </plugins>
    <destinations default="local">
        <destination id="local" runner="local">
            <param id="local_slots">2
            </param>
        </destination>
        <destination id="sge_default" runner="drmaa_default">
            <param id="nativeSpecification">-pe shared 4 -l highp -l h_rt=24:00:00
            </param>
        </destination>
    </destinations>
    <tools>
       <tool id="tophat" destination="sge_default" />
       <tool id="bowtie2" destination="sge_default" />
       <tool id="bowtie_wrapper" destination="sge_default" />
       <tool id="bwa_wrapper" destination="sge_default" />
    </tools>
</job_conf>

The jobs submitted to default/local worked well, but jobs to sge_default cannot be pushed to the queuing system.

For example, after tophat was submitted, a job script galaxy_id.sh) can be found in database/jobs_directory. But, I cannot see it in sge queuing system (using qstat command). There is no script found in database/pbs directory as in the old galaxy system.

Where should I look at to solve the issue or any misconfiguration in job_conf.xml file?

Thank you for your help!

Jane

admin sge galaxy job_conf.xml drmaa • 533 views

ADD COMMENT • link •

modified 7 months ago by Nate Coraor ♦ 3.2k • written 7 months ago by janecao999 • 0

7 months ago by

Nate Coraor ♦ 3.2k

United States

Nate Coraor ♦ 3.2k wrote:

The pbs directory isn't used any more, the job script is written directly to the job's working directory is the only copy. The Galaxy job handler's logs should indicate if there was an error submitting to SGE.

ADD COMMENT • link written 7 months ago by Nate Coraor ♦ 3.2k

Hi Nate,

Where can I define job handler? In the old version of galaxy, I defined job handlers in universe_wsgi.ini file with definition as:

[server:handler0] use = egg:Paste#http port = 8090 host = 127.0.0.1 use_threadpool = True threadpool_workers = 8

[server:handler1] use = egg:Paste#http port = 8091 host = 127.0.0.1 use_threadpool = True threadpool_workers = 8

In the new galaxy config file, galaxy.yml, it seems the definition for job handler is not existed. I used handler definition from job_conf.xml.sample_basic (<handler id="main">), but I had to remove it because of the PR reported in https://github.com/galaxyproject/galaxy/pull/5981/files. As no job handler defined, I have galaxy.log file only but no handler.log files. Is there a way for me to add definition of job handler into either galaxy.yml or job_conf.xml file?

Thanks a lot for your advise!

ADD REPLY • link written 7 months ago by janecao999 • 0

The galaxy.yml was configured as uWSGI all-in-one job handling. All job logs should be sent to galaxy.log file. I checked the log file and the job associated with tophat has log as:

galaxy.jobs DEBUG 2018-05-02 19:05:11,320 [p:26594,w:1,m:0] [JobHandlerQueue.monitor_thread] (31) Working directory for job is: /u/home/galaxy/galaxy/ngsgalaxy/galaxy/database/jobs_directory/000/31 galaxy.jobs.handler DEBUG 2018-05-02 19:05:11,338 [p:26594,w:1,m:0] [JobHandlerQueue.monitor_thread] (31) Dispatching to sge runner galaxy.jobs DEBUG 2018-05-02 19:05:11,387 [p:26594,w:1,m:0] [JobHandlerQueue.monitor_thread] (31) Persisting job destination (destination id: sge_default) galaxy.jobs.runners DEBUG 2018-05-02 19:05:11,422 [p:26594,w:1,m:0] [JobHandlerQueue.monitor_thread] Job [31] queued (83.650 ms) galaxy.jobs.handler INFO 2018-05-02 19:05:11,436 [p:26594,w:1,m:0] [JobHandlerQueue.monitor_thread] (31) Job dispatched galaxy.jobs.command_factory INFO 2018-05-02 19:05:11,750 [p:26594,w:1,m:0] [DRMAARunner.work_thread-0] Built script [/u/home/galaxy/galaxy/ngsgalaxy/galaxy/database/jobs_directory/000/31/tool_script.sh] for tool command [tophat --version > /u/home/galaxy/galaxy/ngsgalaxy/galaxy/database/tmp/GALAXY_VERSION_STRING_31 2>&1; python /u/home/galaxy/galaxy/ngsgalaxy/galaxy/tools/ngs_rna/tophat_wrapper.py --num-threads="4" --junctions-output=/u/home/galaxy/galaxy/ngsgalaxy/galaxy/database/files/000/dataset_62.dat --hits-output=/u/home/galaxy/galaxy/ngsgalaxy/galaxy/database/files/000/dataset_63.dat --indexes-path="/u/home/galaxy/galaxy/GalaxyData/genomes/bowtie/hg19" --single-paired=single --input1=/u/home/galaxy/galaxy/ngsgalaxy/galaxy/database/files/000/dataset_19.dat --settings=preSet] galaxy.jobs.runners DEBUG 2018-05-02 19:05:12,475 [p:26594,w:1,m:0] [DRMAARunner.work_thread-0] (31) command is: .... sh -c "exit $return_code" galaxy.jobs.runners.drmaa DEBUG 2018-05-02 19:05:12,529 [p:26594,w:1,m:0] [DRMAARunner.work_thread-0] (31) submitting file /u/home/galaxy/galaxy/ngsgalaxy/galaxy/database/jobs_directory/000/31/galaxy_31.sh galaxy.jobs.runners.drmaa DEBUG 2018-05-02 19:05:12,529 [p:26594,w:1,m:0] [DRMAARunner.work_thread-0] (31) native specification is: -q galaxy_pod.q -pe shared 4 -l highp -l h_rt=24:00:00

There is no error associated with the job, but the job didn't show in the sge queue. Any tools can be used to detect the process between galaxy job submission to sge?

ADD REPLY • link written 7 months ago by janecao999 • 0

After the native specification message you should see more, e.g.:

galaxy.jobs.runners.drmaa DEBUG 2018-05-03 08:48:34,950 (19306430) native specification is: --partition=normal --nodes=1 --ntasks=1 --time=36:00:00  
galaxy.jobs.runners.drmaa INFO 2018-05-03 08:48:34,964 (19306430) queued as 13407767  
galaxy.jobs DEBUG 2018-05-03 08:48:34,965 (19306430) Persisting job destination (destination id: slurm_normal)  
galaxy.jobs.runners.drmaa DEBUG 2018-05-03 08:48:36,022 (19306430/13407767) state change: job is queued and active

If you don't see any log messages after the native specification, this would indicate that the submission process is hanging. Are you able to successfully submit jobs from the Galaxy server on the command line (e.g. with qsub)?

ADD REPLY • link written 7 months ago by Nate Coraor ♦ 3.2k

After native specification, I didn't see "queued as #id". I can use "qsub" in command line to submit jobs without any problems. I also turned on the heatbeat.log in galaxy.yml file, but the logs just gave me a bunch of traceback dump for each main.web threads and it doesn't seem contain any specific error messages. The only thing related with error is: return f(*(args + (error_buffer, sizeof(error_buffer)))).

Is there a way to trace where the submission process is?

ADD REPLY • link written 6 months ago by janecao999 • 0

I would suggest running a separate "webless" handler so that you can isolate the heartbeat to that process.

ADD REPLY • link written 6 months ago by Nate Coraor ♦ 3.2k

when I added webless handler into job_conf.xml file (<tool id="tophat" ,="" handler="handler1"/>, run "./scripts/galaxy-main -c config/galaxy.yml --server-name handler1 --daemonize", it exited with "ImportError: Attempted to use Galaxy in daemon mode, but daemonize is unavailable.".

I then used uWSGI all-in-one job handling function and added main.web.3 as hander for tophat (<tool id="tophat" hander="main.web.3"/>. After submitting tophat job, the majority contents in heartbeat_main.web.3.log like:

Thread 140238722336512, <Thread(DRMAARunner.work_thread-0, started daemon 140238722336512)>:

  File "/u/local/apps/python/2.7.2/lib/python2.7/threading.py", line 525, in __bootstrap
    self.__bootstrap_inner()
  File "/u/local/apps/python/2.7.2/lib/python2.7/threading.py", line 552, in __bootstrap_inner
    self.run()
  File "/u/local/apps/python/2.7.2/lib/python2.7/threading.py", line 505, in run
    self.__target(*self.__args, **self.__kwargs)
  File "lib/galaxy/jobs/runners/__init__.py", line 95, in run_next
    (method, arg) = self.work_queue.get()
  File "/u/local/apps/python/2.7.2/lib/python2.7/Queue.py", line 168, in get
    self.not_empty.wait()
  File "/u/local/apps/python/2.7.2/lib/python2.7/threading.py", line 244, in wait
    waiter.acquire()

Thread 140238960465664, <Heartbeat(Heartbeat Thread, started daemon 140238960465664)>:

  File "/u/local/apps/python/2.7.2/lib/python2.7/threading.py", line 525, in __bootstrap
    self.__bootstrap_inner()
  File "/u/local/apps/python/2.7.2/lib/python2.7/threading.py", line 552, in __bootstrap_inner
    self.run()
  File "lib/galaxy/util/heartbeat.py", line 64, in run
    self.dump()
  File "lib/galaxy/util/heartbeat.py", line 94, in dump
    traceback.print_stack(frame, file=self.file)

Thread 140239547021056, <_MainThread(uWSGIWorker3Core0, started 140239547021056)>:

  File ".venv/bin/uwsgi", line 11, in <module>
    sys.exit(run())

Thread 140237965608704, <Thread(JobHandlerQueue.monitor_thread, started daemon 140237965608704)>:

  File "/u/local/apps/python/2.7.2/lib/python2.7/threading.py", line 525, in __bootstrap
    self.__bootstrap_inner()
  File "/u/local/apps/python/2.7.2/lib/python2.7/threading.py", line 552, in __bootstrap_inner
    self.run()
  File "/u/local/apps/python/2.7.2/lib/python2.7/threading.py", line 505, in run
    self.__target(*self.__args, **self.__kwargs)
  File "lib/galaxy/jobs/handler.py", line 216, in __monitor
    self._monitor_sleep(1)
  File "lib/galaxy/util/monitors.py", line 37, in _monitor_sleep
    self.sleeper.sleep(sleep_amount)
  File "lib/galaxy/util/sleeper.py", line 16, in sleep
    self.condition.wait(seconds)
  File "/u/local/apps/python/2.7.2/lib/python2.7/threading.py", line 263, in wait
    _sleep(delay)

End dump

ADD REPLY • link written 6 months ago by janecao999 • 0

Sorry I missed your reply. If you're still having issues, please let me know. This heartbeat log appears to show the threads in an idle state.

To run the daemon handler, you can install daemonize in the virtualenv with ./.venv/bin/pip install daemonize.

ADD REPLY • link written 6 months ago by Nate Coraor ♦ 3.2k

Hi Nate,

The issue is still existed. Again, the galaxy stopped at self.ds.run_job(**jt) step. we debugged further from there and noticed that "setattr(template,key,kwds[key])" command doesn't work. I also posted the issue in galaxy IRC site.

ADD REPLY • link written 6 months ago by janecao999 • 0

Ok, at this stage I think it would be best to continue this on IRC, or the galaxy-dev mailing list if I miss your question on IRC.

ADD REPLY • link written 5 months ago by Nate Coraor ♦ 3.2k

Hi Nate,

I checked drmaa.py (lib/galaxy/jobs/runners) from galaxy and noticed that external_runjob_script was returned as "None" from job_wrapper.get_destination_configuration. Because it is 'None', a while loop to create external_job_id shall be called. But it seems that self.ds.run_job(**jt) is not executed at all.

Do you know what potential issue caused that? Can it be in drmaa python library issue?

Thank you for your support!

ADD REPLY • link written 6 months ago by janecao999 • 0

Similar posts • Search »