Question: sge drmaa cluster galaxy configuration
0
gravatar for janecao999
3 months ago by
janecao9990
janecao9990 wrote:

I recently installed a local galaxy. I want to distribute jobs to local (as default and run in headnode) or to a sge cluster. The job_conf.xml was configured as:

<?xml version="1.0"?>
<!-- A sample job config that explicitly configures job running the way it is configured by default (if there is no explicit config). -->
<job_conf>
    <plugins>
        <plugin id="local" type="runner" load="galaxy.jobs.runners.local:LocalJobRunner" workers="4"/>
        <plugin id="drmaa_default" type="runner" load="galaxy.jobs.runners.drmaa:DRMAAJobRunner" />
    </plugins>
    <destinations default="local">
        <destination id="local" runner="local">
            <param id="local_slots">2
            </param>
        </destination>
        <destination id="sge_default" runner="drmaa_default">
            <param id="nativeSpecification">-pe shared 4 -l highp -l h_rt=24:00:00
            </param>
        </destination>
    </destinations>
    <tools>
       <tool id="tophat" destination="sge_default" />
       <tool id="bowtie2" destination="sge_default" />
       <tool id="bowtie_wrapper" destination="sge_default" />
       <tool id="bwa_wrapper" destination="sge_default" />
    </tools>
</job_conf>

The jobs submitted to default/local worked well, but jobs to sge_default cannot be pushed to the queuing system.

For example, after tophat was submitted, a job script galaxy_id.sh) can be found in database/jobs_directory. But, I cannot see it in sge queuing system (using qstat command). There is no script found in database/pbs directory as in the old galaxy system.

Where should I look at to solve the issue or any misconfiguration in job_conf.xml file?

Thank you for your help!

Jane

ADD COMMENTlink modified 3 months ago by Nate Coraor3.2k • written 3 months ago by janecao9990
2
gravatar for Nate Coraor
3 months ago by
Nate Coraor3.2k
United States
Nate Coraor3.2k wrote:

The pbs directory isn't used any more, the job script is written directly to the job's working directory is the only copy. The Galaxy job handler's logs should indicate if there was an error submitting to SGE.

ADD COMMENTlink written 3 months ago by Nate Coraor3.2k

Hi Nate,

Where can I define job handler? In the old version of galaxy, I defined job handlers in universe_wsgi.ini file with definition as:

[server:handler0] use = egg:Paste#http port = 8090 host = 127.0.0.1 use_threadpool = True threadpool_workers = 8

[server:handler1] use = egg:Paste#http port = 8091 host = 127.0.0.1 use_threadpool = True threadpool_workers = 8

In the new galaxy config file, galaxy.yml, it seems the definition for job handler is not existed. I used handler definition from job_conf.xml.sample_basic (<handler id="main">), but I had to remove it because of the PR reported in https://github.com/galaxyproject/galaxy/pull/5981/files. As no job handler defined, I have galaxy.log file only but no handler.log files. Is there a way for me to add definition of job handler into either galaxy.yml or job_conf.xml file?

Thanks a lot for your advise!

ADD REPLYlink written 3 months ago by janecao9990

The galaxy.yml was configured as uWSGI all-in-one job handling. All job logs should be sent to galaxy.log file. I checked the log file and the job associated with tophat has log as:

galaxy.jobs DEBUG 2018-05-02 19:05:11,320 [p:26594,w:1,m:0] [JobHandlerQueue.monitor_thread] (31) Working directory for job is: /u/home/galaxy/galaxy/ngsgalaxy/galaxy/database/jobs_directory/000/31 galaxy.jobs.handler DEBUG 2018-05-02 19:05:11,338 [p:26594,w:1,m:0] [JobHandlerQueue.monitor_thread] (31) Dispatching to sge runner galaxy.jobs DEBUG 2018-05-02 19:05:11,387 [p:26594,w:1,m:0] [JobHandlerQueue.monitor_thread] (31) Persisting job destination (destination id: sge_default) galaxy.jobs.runners DEBUG 2018-05-02 19:05:11,422 [p:26594,w:1,m:0] [JobHandlerQueue.monitor_thread] Job [31] queued (83.650 ms) galaxy.jobs.handler INFO 2018-05-02 19:05:11,436 [p:26594,w:1,m:0] [JobHandlerQueue.monitor_thread] (31) Job dispatched galaxy.jobs.command_factory INFO 2018-05-02 19:05:11,750 [p:26594,w:1,m:0] [DRMAARunner.work_thread-0] Built script [/u/home/galaxy/galaxy/ngsgalaxy/galaxy/database/jobs_directory/000/31/tool_script.sh] for tool command [tophat --version > /u/home/galaxy/galaxy/ngsgalaxy/galaxy/database/tmp/GALAXY_VERSION_STRING_31 2>&1; python /u/home/galaxy/galaxy/ngsgalaxy/galaxy/tools/ngs_rna/tophat_wrapper.py --num-threads="4" --junctions-output=/u/home/galaxy/galaxy/ngsgalaxy/galaxy/database/files/000/dataset_62.dat --hits-output=/u/home/galaxy/galaxy/ngsgalaxy/galaxy/database/files/000/dataset_63.dat --indexes-path="/u/home/galaxy/galaxy/GalaxyData/genomes/bowtie/hg19" --single-paired=single --input1=/u/home/galaxy/galaxy/ngsgalaxy/galaxy/database/files/000/dataset_19.dat --settings=preSet] galaxy.jobs.runners DEBUG 2018-05-02 19:05:12,475 [p:26594,w:1,m:0] [DRMAARunner.work_thread-0] (31) command is: .... sh -c "exit $return_code" galaxy.jobs.runners.drmaa DEBUG 2018-05-02 19:05:12,529 [p:26594,w:1,m:0] [DRMAARunner.work_thread-0] (31) submitting file /u/home/galaxy/galaxy/ngsgalaxy/galaxy/database/jobs_directory/000/31/galaxy_31.sh galaxy.jobs.runners.drmaa DEBUG 2018-05-02 19:05:12,529 [p:26594,w:1,m:0] [DRMAARunner.work_thread-0] (31) native specification is: -q galaxy_pod.q -pe shared 4 -l highp -l h_rt=24:00:00

There is no error associated with the job, but the job didn't show in the sge queue. Any tools can be used to detect the process between galaxy job submission to sge?

ADD REPLYlink written 3 months ago by janecao9990

After the native specification message you should see more, e.g.:

galaxy.jobs.runners.drmaa DEBUG 2018-05-03 08:48:34,950 (19306430) native specification is: --partition=normal --nodes=1 --ntasks=1 --time=36:00:00  
galaxy.jobs.runners.drmaa INFO 2018-05-03 08:48:34,964 (19306430) queued as 13407767  
galaxy.jobs DEBUG 2018-05-03 08:48:34,965 (19306430) Persisting job destination (destination id: slurm_normal)  
galaxy.jobs.runners.drmaa DEBUG 2018-05-03 08:48:36,022 (19306430/13407767) state change: job is queued and active  

If you don't see any log messages after the native specification, this would indicate that the submission process is hanging. Are you able to successfully submit jobs from the Galaxy server on the command line (e.g. with qsub)?

ADD REPLYlink written 3 months ago by Nate Coraor3.2k

After native specification, I didn't see "queued as #id". I can use "qsub" in command line to submit jobs without any problems. I also turned on the heatbeat.log in galaxy.yml file, but the logs just gave me a bunch of traceback dump for each main.web threads and it doesn't seem contain any specific error messages. The only thing related with error is: return f(*(args + (error_buffer, sizeof(error_buffer)))).

Is there a way to trace where the submission process is?

ADD REPLYlink written 3 months ago by janecao9990

I would suggest running a separate "webless" handler so that you can isolate the heartbeat to that process.

ADD REPLYlink written 3 months ago by Nate Coraor3.2k

when I added webless handler into job_conf.xml file (<tool id="tophat" ,="" handler="handler1"/>, run "./scripts/galaxy-main -c config/galaxy.yml --server-name handler1 --daemonize", it exited with "ImportError: Attempted to use Galaxy in daemon mode, but daemonize is unavailable.".

I then used uWSGI all-in-one job handling function and added main.web.3 as hander for tophat (<tool id="tophat" hander="main.web.3"/>. After submitting tophat job, the majority contents in heartbeat_main.web.3.log like:

Thread 140238722336512, <Thread(DRMAARunner.work_thread-0, started daemon 140238722336512)>:

  File "/u/local/apps/python/2.7.2/lib/python2.7/threading.py", line 525, in __bootstrap
    self.__bootstrap_inner()
  File "/u/local/apps/python/2.7.2/lib/python2.7/threading.py", line 552, in __bootstrap_inner
    self.run()
  File "/u/local/apps/python/2.7.2/lib/python2.7/threading.py", line 505, in run
    self.__target(*self.__args, **self.__kwargs)
  File "lib/galaxy/jobs/runners/__init__.py", line 95, in run_next
    (method, arg) = self.work_queue.get()
  File "/u/local/apps/python/2.7.2/lib/python2.7/Queue.py", line 168, in get
    self.not_empty.wait()
  File "/u/local/apps/python/2.7.2/lib/python2.7/threading.py", line 244, in wait
    waiter.acquire()

Thread 140238960465664, <Heartbeat(Heartbeat Thread, started daemon 140238960465664)>:

  File "/u/local/apps/python/2.7.2/lib/python2.7/threading.py", line 525, in __bootstrap
    self.__bootstrap_inner()
  File "/u/local/apps/python/2.7.2/lib/python2.7/threading.py", line 552, in __bootstrap_inner
    self.run()
  File "lib/galaxy/util/heartbeat.py", line 64, in run
    self.dump()
  File "lib/galaxy/util/heartbeat.py", line 94, in dump
    traceback.print_stack(frame, file=self.file)

Thread 140239547021056, <_MainThread(uWSGIWorker3Core0, started 140239547021056)>:

  File ".venv/bin/uwsgi", line 11, in <module>
    sys.exit(run())

Thread 140237965608704, <Thread(JobHandlerQueue.monitor_thread, started daemon 140237965608704)>:

  File "/u/local/apps/python/2.7.2/lib/python2.7/threading.py", line 525, in __bootstrap
    self.__bootstrap_inner()
  File "/u/local/apps/python/2.7.2/lib/python2.7/threading.py", line 552, in __bootstrap_inner
    self.run()
  File "/u/local/apps/python/2.7.2/lib/python2.7/threading.py", line 505, in run
    self.__target(*self.__args, **self.__kwargs)
  File "lib/galaxy/jobs/handler.py", line 216, in __monitor
    self._monitor_sleep(1)
  File "lib/galaxy/util/monitors.py", line 37, in _monitor_sleep
    self.sleeper.sleep(sleep_amount)
  File "lib/galaxy/util/sleeper.py", line 16, in sleep
    self.condition.wait(seconds)
  File "/u/local/apps/python/2.7.2/lib/python2.7/threading.py", line 263, in wait
    _sleep(delay)

End dump
ADD REPLYlink written 3 months ago by janecao9990

Sorry I missed your reply. If you're still having issues, please let me know. This heartbeat log appears to show the threads in an idle state.

To run the daemon handler, you can install daemonize in the virtualenv with ./.venv/bin/pip install daemonize.

ADD REPLYlink written 12 weeks ago by Nate Coraor3.2k

Hi Nate,

The issue is still existed. Again, the galaxy stopped at self.ds.run_job(**jt) step. we debugged further from there and noticed that "setattr(template,key,kwds[key])" command doesn't work. I also posted the issue in galaxy IRC site.

ADD REPLYlink written 11 weeks ago by janecao9990

Ok, at this stage I think it would be best to continue this on IRC, or the galaxy-dev mailing list if I miss your question on IRC.

ADD REPLYlink written 10 weeks ago by Nate Coraor3.2k

Hi Nate,

I checked drmaa.py (lib/galaxy/jobs/runners) from galaxy and noticed that external_runjob_script was returned as "None" from job_wrapper.get_destination_configuration. Because it is 'None', a while loop to create external_job_id shall be called. But it seems that self.ds.run_job(**jt) is not executed at all.

Do you know what potential issue caused that? Can it be in drmaa python library issue?

Thank you for your support!

ADD REPLYlink written 3 months ago by janecao9990
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 105 users visited in the last hour