slurmstepd error job exceeded memory limit (8034000

Question: slurmstepd error job exceeded memory limit (8034000 > 7864320) being killed

4.1 years ago by

United States

hi, I'm posting this question because i didn't see a previous post about it (slurmstepd error, seen on thursday oct 30 @ 6 PM)

i ran a job using the galaxy browser interface following the Galaxy 101 tutorial. this particular workflow included importing 1) all exons [~610,000 regions in bed format] and 2) all repeats [~5,600,000 regions in bed format] from the hg19 build from UCSC genome website.

after importing these two datasets, the next step is a join command to merge the two datasets. this job cancelled with (slurmstepd: error: job ###### exceeded memory limit (8034000 > 7864320), being killed) message.

I'm sure i could find a workaround for this issue, but i am a beginner and it is conceptually easier for me to simply join a all exons dataset with a all repeats dataset without dividing them into smaller pieces. the memory limit seems to be just over the limit, so perhaps earlier genome builds contained slightly less information and ran without triggering the error.

my question is: is the memory limit a temporary issue, i.e. will the memory limit be increased in the near future? or...

do i need to find a workaround?

thanks, chris

software error galaxy • 8.7k views

ADD COMMENT • link •

modified 4.1 years ago • written 4.1 years ago by christopher.paran • 0

4.1 years ago by

Nate Coraor ♦ 3.2k

United States

Nate Coraor ♦ 3.2k wrote:

Just to add a bit to what Jen said, this is caused by a recent change in the way jobs run at usegalaxy.org. In the past, memory usage on our cluster was uncontrolled. Jobs ran on nodes with 128 GB of memory but could be scheduled with up to 15 other jobs on the same node. If the other jobs were using small amounts of memory, that meant other jobs could use far more than 8 GB.

However, this made things inconsistent - one time you run a job it may have worked fine. The next time you ran it, it may have run out of memory. It was all "luck of the draw." In order to address this and deal with some other issues, we started scheduling memory - jobs are now limited to ~8 GB. This will cause some things to fail that probably would not have failed most of the time in the past. However, we can address this by allocating more memory for tools that need it - we just need to know when this happens. In the future, if you encounter out-of-memory errors, please use the bug icon to report them.

We will also be enabling a feature that allows us to reschedule jobs with increased memory allocations if they fail with the default allocation, but this won't be until after the December release, most likely.

In this specific case, I'll check the Galaxy 101 steps to make sure it's still possible to run. It may be the case that one of the job inputs or tool parameters were not quite correct. I don't think this tool should've been using so much memory.

ADD COMMENT • link written 4.1 years ago by Nate Coraor ♦ 3.2k

4.1 years ago by

Jennifer Hillman Jackson ♦ 25k

United States

Jennifer Hillman Jackson ♦ 25k wrote:

Hi Chris,

The first thing to double check is the protocol. There are a few "join" type tools and sometimes these are mixed up (even by some of us who have done it many times in training!).

Using the exact tools and parameters from the example is the best way to work through it. After you have that going and wish to modify it, working within the public Main instance's resource allocation is necessary if you want to do research there. (at http://usegalaxy.org)

If all seems to be in alignment with the tutorial, I am wondering if you have tried to run the job again since the initial failure? Maybe try one more time, now? It is possible that you ran into a cluster node that happened to be working on another larger job. Not common, but has happened.

When/if you need more compute, or wish to start a large project, a cloud Galaxy can be a very powerful option. Amazon offers grants for researchers/students. Working on your own personal CloudMan Galaxy, especially one with scaled resources, keeps large projects humming along - and with no long-term investments/commitments. http://usegalaxy.org/cloud

After reviewing and re-running, please let us know if you continue to have issues - and share more details (exact tool names, non-default parameters, input sources, etc) so that we can help you to double check usage.

Thanks, Jen, Galaxy team

ADD COMMENT • link written 4.1 years ago by Jennifer Hillman Jackson ♦ 25k

4.1 years ago by

christopher.paran • 0

United States

christopher.paran • 0 wrote:

Hi, thanks very much for the responses, I understand the resource allocation now. I believe I did use the appropriate datasets, Feb 2009 Human genome Build and Repeats, and tool, Operate on genomic intervals - Join, as described in the tutorial, so I've gone ahead and used the bug report button in the workspace. I am just teaching myself NGS analysis, although I'll look into the amazon options as well. I'll try again early next year on the galaxy cluster as recommended too. Thanks! Chris

ADD COMMENT • link written 4.1 years ago by christopher.paran • 0

If you rerun your job it should work (I successfully did so using the history you sent us with the bug report). The join tool has been allocated more memory.

It's possible that it will still run out of memory with larger inputs. We are keeping an eye on job completions and will attempt to address memory exhaustion as much as possible.

ADD REPLY • link written 4.1 years ago by Nate Coraor ♦ 3.2k

4.1 years ago by

christopher.paran • 0

United States

christopher.paran • 0 wrote:

i re-ran as well and it did work, i was able to complete the entire workflow in the 101 tutorial as well. thanks very much all for the help! chris

ADD COMMENT • link written 4.1 years ago by christopher.paran • 0

Please log in to add an answer.

Similar posts • Search »