Question: slurmstepd error job exceeded memory limit (8034000 > 7864320) being killed
0
gravatar for christopher.paran
4.1 years ago by
United States
christopher.paran0 wrote:

hi, I'm posting this question because i didn't see a previous post about it (slurmstepd error, seen on thursday oct 30 @ 6 PM)

 

i ran a job using the galaxy browser interface following the Galaxy 101 tutorial. this particular workflow included importing 1) all exons [~610,000 regions in bed format] and 2) all repeats [~5,600,000 regions in bed format] from the hg19 build from UCSC genome website.

after importing these two datasets, the next step is a join command to merge the two datasets. this job cancelled with (slurmstepd: error: job ###### exceeded memory limit (8034000 > 7864320), being killed) message.

I'm sure i could find a workaround for this issue, but i am a beginner and it is conceptually easier for me to simply join a all exons dataset with a all repeats dataset without dividing them into smaller pieces. the memory limit seems to be just over the limit, so perhaps earlier genome builds contained slightly less information and ran without triggering the error.

my question is: is the memory limit a temporary issue, i.e. will the memory limit be increased in the near future? or...

do i need to find a workaround?

thanks, chris

software error galaxy • 8.7k views
ADD COMMENTlink modified 4.1 years ago • written 4.1 years ago by christopher.paran0
1
gravatar for Nate Coraor
4.1 years ago by
Nate Coraor3.2k
United States
Nate Coraor3.2k wrote:

Just to add a bit to what Jen said, this is caused by a recent change in the way jobs run at usegalaxy.org. In the past, memory usage on our cluster was uncontrolled. Jobs ran on nodes with 128 GB of memory but could be scheduled with up to 15 other jobs on the same node. If the other jobs were using small amounts of memory, that meant other jobs could use far more than 8 GB.

However, this made things inconsistent - one time you run a job it may have worked fine. The next time you ran it, it may have run out of memory. It was all "luck of the draw." In order to address this and deal with some other issues, we started scheduling memory - jobs are now limited to ~8 GB. This will cause some things to fail that probably would not have failed most of the time in the past. However, we can address this by allocating more memory for tools that need it - we just need to know when this happens. In the future, if you encounter out-of-memory errors, please use the bug icon to report them.

We will also be enabling a feature that allows us to reschedule jobs with increased memory allocations if they fail with the default allocation, but this won't be until after the December release, most likely.

In this specific case, I'll check the Galaxy 101 steps to make sure it's still possible to run. It may be the case that one of the job inputs or tool parameters were not quite correct. I don't think this tool should've been using so much memory.

ADD COMMENTlink written 4.1 years ago by Nate Coraor3.2k
0
gravatar for Jennifer Hillman Jackson
4.1 years ago by
United States
Jennifer Hillman Jackson25k wrote:

Hi Chris,

The first thing to double check is the protocol. There are a few "join" type tools and sometimes these are mixed up (even by some of us who have done it many times in training!).

Using the exact tools and parameters from the example is the best way to work through it. After you have that going and wish to modify it, working within the public Main instance's resource allocation is necessary if you want to do research there. (at http://usegalaxy.org)

If all seems to be in alignment with the tutorial, I am wondering if you have tried to run the job again since the initial failure? Maybe try one more time, now? It is possible that you ran into a cluster node that happened to be working on another larger job. Not common, but has happened.

When/if you need more compute, or wish to start a large project, a cloud Galaxy can be a very powerful option. Amazon offers grants for researchers/students. Working on your own personal CloudMan Galaxy, especially one with scaled resources, keeps large projects humming along - and with no long-term investments/commitments. http://usegalaxy.org/cloud

After reviewing and re-running, please let us know if you continue to have issues - and share more details (exact tool names, non-default parameters, input sources, etc) so that we can help you to double check usage.

Thanks, Jen, Galaxy team

ADD COMMENTlink written 4.1 years ago by Jennifer Hillman Jackson25k
0
gravatar for christopher.paran
4.1 years ago by
United States
christopher.paran0 wrote:

Hi, thanks very much for the responses, I understand the resource allocation now. I believe I did use the appropriate datasets, Feb 2009 Human genome Build and Repeats, and tool, Operate on genomic intervals - Join, as described in the tutorial, so I've gone ahead and used the bug report button in the workspace. I am just teaching myself NGS analysis, although I'll look into the amazon options as well. I'll try again early next year on the galaxy cluster as recommended too. Thanks! Chris

ADD COMMENTlink written 4.1 years ago by christopher.paran0

If you rerun your job it should work (I successfully did so using the history you sent us with the bug report). The join tool has been allocated more memory.

It's possible that it will still run out of memory with larger inputs. We are keeping an eye on job completions and will attempt to address memory exhaustion as much as possible.

ADD REPLYlink written 4.1 years ago by Nate Coraor3.2k
0
gravatar for christopher.paran
4.1 years ago by
United States
christopher.paran0 wrote:

i re-ran as well and it did work, i was able to complete the entire workflow in the 101 tutorial as well. thanks very much all for the help! chris

ADD COMMENTlink written 4.1 years ago by christopher.paran0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 182 users visited in the last hour