Question: Random tool failures
1
gravatar for brian.hermann
2.8 years ago by
United States
brian.hermann40 wrote:

Hi,

I've encountered a series of tool failures over the past day or so, the most recent being TopHat and Cuffnorm runs on a dataset collection.  The most recent failures were due to "failure preparing job" and "failure preparing job script" and "Cluster could not complete job." I noted the following in some of the errors:  IOError: [Errno 28] No space left on device.  My disk space quota is not used up and I have run these analyses before. Any advice on how to fix this? 

Thanks!

Brian

galaxy • 893 views
ADD COMMENTlink modified 2.8 years ago • written 2.8 years ago by brian.hermann40
0
gravatar for Jennifer Hillman Jackson
2.8 years ago by
United States
Jennifer Hillman Jackson25k wrote:

Hello,

These are all cluster related errors. If the jobs were executed on the Test server, then the errors could be related issues we had on that server recently (these can happen for a variety of reasons and is somewhat expected - we test code here).

Please try running the jobs again today at the Main server http://usegalaxy.org and let us know if there are more problems. Please submit a bug report from a failed job started today if the issue continues. Be sure to leave error datasets undeleted as well as all inputs when sending in a bug report.

Thanks, Jen, Galaxy team

ADD COMMENTlink written 2.8 years ago by Jennifer Hillman Jackson25k

Hi Jennifer,

I reran some cuffnorm jobs (on Usegalaxy.org) and had the same result (see below).  I noticed that some of the jobs from the previous batch did complete correctly even after this post, but none (so far) on the re-run.  Thanks for your help.

Best,

Brian

 

Traceback (most recent call last):
  File "/galaxy-repl/instances/main/server/lib/galaxy/jobs/runners/__init__.py", line 170, in prepare_job
    include_work_dir_outputs=include_work_dir_outputs,
  File "/galaxy-repl/instances/main/server/lib/galaxy/jobs/runners/__init__.py", line 200, in build_command_line
    container=container
  File "/galaxy-repl/instances/main/server/lib/galaxy/jobs/command_factory.py", line 59, in build_command
    externalized_commands = __externalize_commands(job_wrapper, shell, commands_builder, remote_command_params)
  File "/galaxy-repl/instances/main/server/lib/galaxy/jobs/command_factory.py", line 91, in __externalize_commands
    write_script(local_container_script, script_contents, config)
  File "/galaxy-repl/instances/main/server/lib/galaxy/jobs/runners/util/job_script/__init__.py", line 101, in write_script
    f.write(contents)
IOError: [Errno 28] No space left on device

 

 

ADD REPLYlink written 2.8 years ago by brian.hermann40
0
gravatar for brian.hermann
2.8 years ago by
United States
brian.hermann40 wrote:

hi Jen,

Thanks for your reply.  I thought I was running these on the main usegalaxy.org server.  I noticed that queued jobs starting running again, so perhaps this was just overloaded clusters?  I submitted bug reports referencing this post and will leave the files. Will re-run today and let you know if they terminate or complete.Thanks!

-Brian

ADD COMMENTlink written 2.8 years ago by brian.hermann40

If the jobs were just "grey - waiting to run" to start with, then the cluster was is busy and the jobs are in the queue. Leave these to run.

If you had trouble launching jobs and the error was reported in the history panel or as a pop-up, then this was almost certainly a server load issue.

If the jobs error out (how some of your original errors indicate), then this could be the server, cluster, or both. This is where the known issues at the Test server were at.

If jobs run (yellow), then fall back to in progress (grey), it could be that the jobs exceeded compute resources on the default cluster and were automatically re-run at the longer running cluster (Stampede) or our admin saw these and is re-running all in batch (could be yours since they were reported, or could be all that failed in a specific time window). The first will run quicker than the latter, but do leave them to run if are queued or in progress (yellow or grey).

Details about how jobs *usually* execute: https://wiki.galaxyproject.org/Support#Dataset_status_and_how_jobs_execute

We will also examine your bug report just to make sure nothing else is going on. Thanks for sending that it! Jen

ADD REPLYlink written 2.8 years ago by Jennifer Hillman Jackson25k

 Hi Jennifer,

As you suggested, I reran a batch of jobs that had previously failed and still getting error outs.  See below for an example from tophat.  One of the tophat jobs completed, but most failed.  I guess I will sit tight until I hear back again from you whether there are additional issues.  Thanks!

-Brian

Fatal error: Tool execution failed

[2016-02-16 15:45:54] Beginning TopHat run (v2.0.14)
-----------------------------------------------
[2016-02-16 15:45:54] Checking for Bowtie
		  Bowtie version:	 2.2.5.0
[2016-02-16 15:45:54] Checking for Bowtie index files (genome)..
[2016-02-16 15:45:55] Checking for reference FASTA file
[2016-02-16 15:45:55] Generating SAM header for /galaxy/data/mm10/bowtie2_index/mm10
[2016-02-16 15:48:13] Preparing reads
	 left reads: min. length=87, max. length=100, 2036107 kept reads (320 discarded)
	right reads: min. length=88, max. length=100, 2032211 kept reads (4216 discarded)
[2016-02-16 15:49:27] Mapping left_kept_reads to genome mm10 with Bowtie2 
[2016-02-16 15:54:14] Mapping left_kept_reads_seg1 to genome mm10 with Bowtie2 (1/4)
[2016-02-16 15:54:43] Mapping left_kept_reads_seg2 to genome mm10 with Bowtie2 (2/4)
[2016-02-16 15:55:09] Mapping left_kept_reads_seg3 to genome mm10 with Bowtie2 (3/4)
[2016-02-16 15:55:34] Mapping left_kept_reads_seg4 to genome mm10 with Bowtie2 (4/4)
[2016-02-16 15:55:58] Mapping right_kept_reads to genome mm10 with Bowtie2 
[bam_header_read] EOF marker is absent. The input is probably truncated.
[main_samview] truncated file.
Traceback (most recent call last):
  File "/galaxy/main/deps/tophat/2.0.14/iuc/package_tophat_2_0_14/536f7bb5616d/bin/tophat", line 4095, in <module>
    sys.exit(main())
  File "/galaxy/main/deps/tophat/2.0.14/iuc/package_tophat_2_0_14/536f7bb5616d/bin/tophat", line 4061, in main
    user_supplied_deletions)
  File "/galaxy/main/deps/tophat/2.0.14/iuc/package_tophat_2_0_14/536f7bb5616d/bin/tophat", line 3552, in spliced_alignment
    segment_len)
  File "/galaxy/main/deps/tophat/2.0.14/iuc/package_tophat_2_0_14/536f7bb5616d/bin/tophat", line 2998, in split_reads
    zf.close()
  File "/galaxy/main/deps/tophat/2.0.14/iuc/package_tophat_2_0_14/536f7bb5616d/bin/tophat", line 1825, in close
    if self.ftarget: self.ftarget.close()
IOError: [Errno 28] No space left on device

 

 

ADD REPLYlink written 2.8 years ago by brian.hermann40

Thanks for sending this back. Our admin is looking into the issue again. Jen

ADD REPLYlink written 2.8 years ago by Jennifer Hillman Jackson25k

I tried rerunning some of the jobs overnight and they seemed to be doing fine (no failures) until ~8am ET this morning, although they were running slowly.

ADD REPLYlink written 2.8 years ago by brian.hermann40
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 169 users visited in the last hour