Question: Galaxy - SGE, job terminated but failed
0
gravatar for FT
3.2 years ago by
FT0
France
FT0 wrote:

Hi everyone,

I come back to you with another problem  : I tried to set Galaxy to our cluster. I used this job_conf.xml :

 


<?xml version="1.0"?>
<job_conf>
        <plugins workers="5">
                <plugin id="sge" type="runner" load="galaxy.jobs.runners.drmaa:DRMAAJobRunner">
                        <param id="drmaa_library_path">/opt/sge/lib/linux-x64/libdrmaa.so</param>
                </plugin>
        </plugins>
        <handlers>
                <handler id="main"/>
        </handlers>
        <destinations>
                <destination id="sge_runner" runner="sge"/>
        </destinations>
</job_conf>


 

But it gave me this error :

 


[...]
galaxy.util.object_wrapper WARNING 2015-09-08 17:59:37,249 Unable to create dynamic subclass for <type 'instance'>, None: metaclass conflict: the metaclass of a derived class must be a (non-strict) subclass

of the metaclasses of all its bases
galaxy.tools.actions INFO 2015-09-08 17:59:37,369 Handled output (120.048 ms)
galaxy.tools.actions INFO 2015-09-08 17:59:37,521 Verified access to datasets (0.208 ms)
galaxy.tools.execute DEBUG 2015-09-08 17:59:37,587 Tool [myTool_id] created job [25] (378.669 ms)
10.80.6.195 - - [08/Sep/2015:17:59:37 +0200] "POST /api/tools HTTP/1.1" 200 - "http://***:***/tool_runner?tool_id=myTool_id" "Mozilla/5.0 (Windows NT 6.1; rv:39.0) Gecko/20100101 Firefox/39.0"
10.80.6.195 - - [08/Sep/2015:17:59:37 +0200] "GET /api/histories/ebfb8f50c6abde6d/contents HTTP/1.1" 200 - "http://***:***/" "Mozilla/5.0 (Windows NT 6.1; rv:39.0) Gecko/20100101 Firefox/39.0"
galaxy.jobs DEBUG 2015-09-08 17:59:38,603 (25) Working directory for job is: /galaxy/data/files/000/25
galaxy.jobs.handler DEBUG 2015-09-08 17:59:38,612 (25) Dispatching to sge runner
galaxy.jobs DEBUG 2015-09-08 17:59:38,890 (25) Persisting job destination (destination id: sge_runner)
galaxy.jobs.runners DEBUG 2015-09-08 17:59:38,898 Job [25] queued (285.758 ms)
galaxy.jobs.handler INFO 2015-09-08 17:59:38,927 (25) Job dispatched
galaxy.jobs.runners DEBUG 2015-09-08 17:59:40,041 (25) command is: bash /***/TOOLS/tools/myTool.test/myTool.sh /***/TOOLS/tools-tests/*** /galaxy/data/files/000/dataset_25.dat; return_code=$?; python

"/galaxy/data/files/000/25/set_metadata_dKM1uc.py" "/galaxy/data/tmp/tmpxtCEBN" "/galaxy/data/files/000/25/galaxy.json"

"/galaxy/data/files/000/25/metadata_in_HistoryDatasetAssociation_25_GDdz4S,/galaxy/data/files/000/25/metadata_kwds_HistoryDatasetAssociation_25_KRTJ4l,/galaxy/data/files/000/25/metadata_out_HistoryDatasetAss

ociation_25_b_Y61R,/galaxy/data/files/000/25/metadata_results_HistoryDatasetAssociation_25_haQsf0,/galaxy/data/files/000/dataset_25.dat,/galaxy/data/files/000/25/metadata_override_HistoryDatasetAssociation_2

5_PKn7Zb" 5242880; sh -c "exit $return_code"
galaxy.jobs.runners.drmaa DEBUG 2015-09-08 17:59:40,137 (25) submitting file /galaxy/data/files/000/25/galaxy_25.sh
galaxy.jobs.runners.drmaa INFO 2015-09-08 17:59:40,151 (25) queued as 4216637
galaxy.jobs DEBUG 2015-09-08 17:59:40,230 (25) Persisting job destination (destination id: sge_runner)
galaxy.jobs.runners.drmaa DEBUG 2015-09-08 17:59:41,036 (25/4216637) state change: job is queued and active
10.80.6.195 - - [08/Sep/2015:17:59:41 +0200] "GET /api/histories/ebfb8f50c6abde6d/contents HTTP/1.1" 200 - "http://***:**/" "Mozilla/5.0 (Windows NT 6.1; rv:39.0) Gecko/20100101 Firefox/39.0"
10.80.6.195 - - [08/Sep/2015:17:59:45 +0200] "GET /api/histories/ebfb8f50c6abde6d/contents HTTP/1.1" 200 - "http://***:**/" "Mozilla/5.0 (Windows NT 6.1; rv:39.0) Gecko/20100101 Firefox/39.0"
galaxy.jobs.runners.drmaa DEBUG 2015-09-08 17:59:47,662 (25/4216637) state change: job is running
galaxy.jobs.runners.drmaa DEBUG 2015-09-08 17:59:49,012 (25/4216637) state change: job finished, but failed
galaxy.datatypes.metadata DEBUG 2015-09-08 17:59:49,745 Cleaning up external metadata files
galaxy.datatypes.metadata DEBUG 2015-09-08 17:59:49,802 Failed to cleanup MetadataTempFile temp files from /galaxy/data/files/000/25/metadata_out_HistoryDatasetAssociation_25_b_Y61R: No JSON object could be

decoded
galaxy.jobs.runners DEBUG 2015-09-08 17:59:49,910 (25/4216637) Unable to cleanup /galaxy/data/files/000/25/galaxy_25.sh: [Errno 2] No such file or directory: '/galaxy/data/files/000/25/galaxy_25.sh'
galaxy.jobs.runners DEBUG 2015-09-08 17:59:49,984 (25/4216637) Unable to cleanup /galaxy/data/files/000/25/galaxy_25.o: [Errno 2] No such file or directory: '/galaxy/data/files/000/25/galaxy_25.o'
galaxy.jobs.runners DEBUG 2015-09-08 17:59:50,088 (25/4216637) Unable to cleanup /galaxy/data/files/000/25/galaxy_25.e: [Errno 2] No such file or directory: '/galaxy/data/files/000/25/galaxy_25.e'
galaxy.jobs.runners DEBUG 2015-09-08 17:59:50,207 (25/4216637) Unable to cleanup /galaxy/data/files/000/25/galaxy_25.ec: [Errno 2] No such file or directory: '/galaxy/data/files/000/25/galaxy_25.ec'
10.80.6.195 - - [08/Sep/2015:17:59:50 +0200] "GET /api/histories/ebfb8f50c6abde6d/contents HTTP/1.1" 200 - "http://srsu369:2208/" "Mozilla/5.0 (Windows NT 6.1; rv:39.0) Gecko/20100101 Firefox/39.0"
10.80.6.195 - - [08/Sep/2015:17:59:50 +0200] "GET /api/histories/ebfb8f50c6abde6d HTTP/1.1" 200 - "http://srsu369:2208/" "Mozilla/5.0 (Windows NT 6.1; rv:39.0) Gecko/20100101 Firefox/39.0"


 

Problem : When I looked at /galaxy/data/files/000/ there was no <job_id> folder (25 for example), only the /galaxy/data/files/000/dataset_25.dat file.


--> The qsub command works (so it doesn't come from the cluster).
--> If I execute Galaxy with the job_conf.xml.sample_basic to run the job with local ressources it also works (so it doesn't come from the tool).
--> I have the rights on /galaxy/data folder.
--> My Galaxy instance, the tools and the data are stored on a shared system which is available for all the cluster nodes.


I am running out of ideas.

Thanks in advance !

F.T

 

***Edit***

 

Hi,

I would like to submit some additional information. It seems that I have 2 issues :

The first one is about folder rights : I execute Galaxy with a galaxy_user (the account I use to install my galaxy instance) which belongs to a galaxy_user_group_1. But the path from where the outputs get directed (file_path) belongs to the same user but with galaxy_user_group_2.

When I execute Galaxy, in most cases the outputs take the rights of the parent folder (galaxy_user::galaxy_user_group_2). It gave me an empty file and a "job terminated but failed" error.
But sometimes the outputs take the rights of the galaxy_user::galaxy_user_group_1 (the user and group installation account). Then I get a "job finished normally" but it appears in red in the Galaxy web page (that's the second problem I think).


I changed the file_path variable to a folder which belongs to galaxy_user::galaxy_user_group_1. All the jobs finished normally but looked red in the galaxy web page and with the "Job output not returned from cluster" error.

 

sge galaxy cluster • 1.3k views
ADD COMMENTlink modified 2.7 years ago • written 3.2 years ago by FT0
0
gravatar for FT
3.2 years ago by
FT0
France
FT0 wrote:


Hi,

I come back to this post with some news. I fixed one of my issues : it turns out to be that one of the nodes of my cluster did'nt share the directory where I store the data. That's why sometimes I got a job that worked with success and sometimes I didn't. It depended on the node where the process was executed (if this node had access to the data or not).

But I'm still stuck on the second issue (The state is being set to ERROR and the output appears in red in the web page but the job finishes normally and I can see or download the output).
I got the following lines in the log :


Handled output (159.005 ms)
Verified access to datasets (0.237 ms)
Tool [myTool_id] created job [54] (385.955 ms)
[…]
(54) Working directory for job is: /galaxy/data/files/000/54
(54) Dispatching to sge runner
(54) Persisting job destination (destination id: sge_runner)
(54) Job [54] queued (303.560 ms)
(54) Job dispatched
(54) command is: bash /path/to/myTool.sh path/to/the/input /galaxy/data/files/000/dataset_54.dat; return_code=$?; python "/galaxy/data/files/000/54/set_metadata_inpmwB.py" "/galaxy/data/tmp/tmpNDr_Oc" "/galaxy/data/files/000/54/galaxy.json"  "/galaxy/data/files/000/54/metadata_in_HistoryDatasetAssociation_54_bOI6dF,/galaxy/data/files/000/54/metadata_kwds_HistoryDatasetAssociation_54_rPgxAm,/galaxy/data/files/000/54/metadata_out_HistoryDatasetAssociation_54_QAHIid,/galaxy/data/files/000/54/metadata_results_HistoryDatasetAssociation_54_LdEQRt,/galaxy/data/files/000/dataset_54.dat,/galaxy/data/files/000/54/metadata_override_HistoryDatasetAssociation_54_FHbvg4" 5242880; sh -c "exit $return_code"
(54) submitting file /galaxy/data/files/000/54/galaxy_54.sh
(54) native specification is: -q SRSTst.q
(54) queued as 4298436
(54) Persisting job destination (destination id: sge_runner)
(54/4298436) state change: job is queued and active
[…]
(54/4298436) state change: job is running
[…]
(54/4298436) state change: job finished normally
[…]
(54/4298436) Job output not returned from cluster: [Errno 2] No such file or directory: '/galaxy/data/files/000/54/galaxy_54.e' setting dataset state to ERROR
[…]
job 54 ended (finish() executed in (466.041 ms))
[…]

 

I set the cleanup_job variable (in galaxy.ini) to "onsuccess" and saw the problem : the folder 54 exists but doesn't contain a galaxy_54.e file. Instead there's a galaxy_54.ec (for exit code). That explain the "No such file or directory: '/galaxy/data/files/000/54/galaxy_54.e'".
But I don't know if it comes from a misconfiguration of my Galaxy instance or of the cluster. Any suggestion ?

Thanks,
F.T

ADD COMMENTlink written 3.2 years ago by FT0
0
gravatar for danielfortin86
2.8 years ago by
United States
danielfortin86110 wrote:

Hi F.T.,

This may be unrelated but I had a error when finishing jobs but also got an error on SGE cluster. I added this setting and this solved my problem, may not work for you though:

<param id="embed_metadata_in_job">False</param>

E.g.:

<destination id="sge" runner="sge">

   <param id="embed_metadata_in_job">False</param>

</destination>

 

ADD COMMENTlink modified 2.8 years ago • written 2.8 years ago by danielfortin86110
0
gravatar for FT
2.7 years ago by
FT0
France
FT0 wrote:

Hi Daniel,

In our case, as I recall, it came from our SGE default setting. It merged the error and output streams into a single file. But Galaxy expects two files. I added the "-j" option to the job_conf.xml file (where you can set the same arguments as in a qsub command). And it worked ;)

F.T

ADD COMMENTlink written 2.7 years ago by FT0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 168 users visited in the last hour