Question: Galaxy instance on jestream- cluster error for downloaded files
1
gravatar for nfillmor
15 months ago by
nfillmor10
nfillmor10 wrote:

I am using a galaxy instance on Jetstream but have run into a problem getting data files into galaxy. After I download the file, the file name (on the right side) turns red and I get the error message: "Unable to run this job due to a cluster error, please retry it later". These downloaded files are also not recognized by the tools.

Does anyone know how this problem can be fixed? Thank-you

admin jetstream galaxy • 689 views
ADD COMMENTlink modified 15 months ago by Enis Afgan690 • written 15 months ago by nfillmor10

There is probably a configuration problem. Are you following these instructions? https://galaxyproject.org/cloud/jetstream/

Do other jobs run or is this the first job/type that fails? For example, have you installed native genomes with a Data Manager and had those be successful?

ADD REPLYlink written 15 months ago by Jennifer Hillman Jackson25k

I am having this same problem. My Galaxy instance on Jetstream seems to be configured correctly since I was able to add myself as an admin and add tools from the Tool Shed. However, none of the avenues for getting files into my History are working: upload from computer, FTP, or directly pasting fasta data via the Fetch/Paste option. After a short while, the job turns red and gives the error: "Unable to run this job due to a cluster error, please retry it later". I have also tried pulling data straight from UCSC Main, but same error. I have reset the hostname and restarted the interface as described here: https://galaxyproject.org/cloud/jetstream/troubleshooting/, but still no luck. Checking supervisorctl, I have the following:

cholley@js-168-202:~$ sudo supervisorctl cron STOPPED Not started galaxy:web0 RUNNING pid 7134, uptime 0:20:15 munge RUNNING pid 1952, uptime 1:13:49 nginx RUNNING pid 1955, uptime 1:13:49 postgresql RUNNING pid 1950, uptime 1:13:49 pre_postgresql EXITED Aug 30 12:24 PM proftpd RUNNING pid 7612, uptime 0:03:49 slurmctld FATAL Exited too quickly (process log may have details) slurmd RUNNING pid 1956, uptime 1:13:49

supervisor> start slurmctld slurmctld: ERROR (abnormal termination)

Should slurmctld be running? It seems like this would be necessary to schedule jobs. It will not start for some reason. Any ideas??? I have used CloudMan via AWS quite successfully, but would like to be able to do jobs with my EXSEDE allocation on Jetstream.

Thanks! Chris

ADD REPLYlink written 15 months ago by cholley20

Would one of you post the Galaxy admin log from Cloudman for troubleshooting? A link to a gist or similar would be great.

ADD REPLYlink written 15 months ago by Jennifer Hillman Jackson25k

sure - here is the gist link for galaxy_web0.log from my 17.01.01 instance.

Let me know if you want any of the other logs.

ADD REPLYlink written 15 months ago by cholley20
2
gravatar for Enis Afgan
15 months ago by
Enis Afgan690
United States
Enis Afgan690 wrote:

It seems Atmosphere messes with the image and the hostname gets out of sync with Slurm. I've found two solutions to this:

  1. Launch a new instance but using our launcher available at https://beta.launch.usegalaxy.org/. When launched there, my instances run jobs fine for as long as I've kept one alive. If launched via Atmosphere, instances quit running jobs after several minutes reporting the cluster error.
  2. Fix your existing instance as follows: edit /etc/hostname to change it to the output of the hostname -s command. Edit /etc/supervisor/conf.d/galaxy.conf and add the following line to the bottom of slurmctld group: environment = SLURM_CONTROL_MACHINE=js-168-195 where js-168-195 is the output of the hostname -s command. Then issue the following command: sudo supervisorctl update && sudo supervisorctl restart slurmctld.

After this, jobs should start running again.

ADD COMMENTlink modified 15 months ago • written 15 months ago by Enis Afgan690

OK - thanks. The beta.launch.usegalaxy.org site's link to the Jetstream service (https://beta.launch.usegalaxy.org/marketplace/appliance/galaxy-stanalone-vm) asks for credentials through OpenStack, which I don't have. Is there any way to use the XSEDE allocation that I have been granted? Also, since this is an issue with Atmosphere, should we contact them to get it resolved? Solution #2 is ok, but needs to be done each time an instance is Resumed, if the IP address has changed.

ADD REPLYlink written 15 months ago by cholley20

Also: option #2 doesn't work for me because I apparently don't have permission to edit those files.... "Permission denied". Any other ideas?

ADD REPLYlink written 15 months ago by cholley20

You need to edit them with sudo. (btw, I'm working on the docs page how to get the credentials for using CloudLaunch)

ADD REPLYlink written 15 months ago by Enis Afgan690

Of course! OK - I edited the hostname and galaxy.conf files and restarted the services, per your instructions. However, even after that, I still get the same errors in my Galaxy instance: Unable to run this job due to a cluster error, please retry it later. I re-verified that the files were edited correctly. Of note, this is what I got on restarting the services:

cholley@js-157-57:~$ sudo supervisorctl reread && sudo supervisorctl restart slu rmctld slurmctld: changed slurmctld: stopped slurmctld: ERROR (abnormal termination)

Here is what sudo supervisorctl gives after all that:

cholley@js-157-57:~$ sudo supervisorctl galaxy:web0 RUNNING pid 5034, uptime 1 day, 19:38:30 munge FATAL Exited too quickly (process log may have details) nginx RUNNING pid 1664, uptime 1 day, 19:46:15 postgresql RUNNING pid 1661, uptime 1 day, 19:46:15 pre_postgresql EXITED Aug 30 04:14 PM proftpd RUNNING pid 1668, uptime 1 day, 19:46:15 slurmctld FATAL Exited too quickly (process log may have details) slurmd RUNNING pid 1665, uptime 1 day, 19:46:15

So slurmctrld is still not working.... After that, I restarted galaxy:

supervisor> restart galaxy:web0 galaxy:web0: stopped galaxy:web0: started supervisor>

and all jobs still fail....

ADD REPLYlink written 15 months ago by cholley20

I updated my answer above so the instructions should work now (the supervisorctl command should have been update instead of reread).

ADD REPLYlink written 15 months ago by Enis Afgan690

Super - this worked to get my Jetstream Galaxy 17.01.01 instance going! Jobs are running fine. Thanks for your help. I will also see if I can get an instance running via your new CloudLaunch instructions.

ADD REPLYlink written 15 months ago by cholley20

Rebooting the instance gets slurmctld going again, but munge is still dead and jobs still fail:

cholley@js-157-57:~$ sudo supervisorctl galaxy:web0 RUNNING pid 1653, uptime 0:01:57 munge FATAL Exited too quickly (process log may have details) nginx RUNNING pid 1646, uptime 0:01:57 postgresql RUNNING pid 1643, uptime 0:01:57 pre_postgresql EXITED Sep 01 12:10 PM proftpd RUNNING pid 1651, uptime 0:01:57 slurmctld RUNNING pid 1645, uptime 0:01:57 slurmd RUNNING pid 1647, uptime 0:01:57 supervisor>

ADD REPLYlink modified 15 months ago • written 15 months ago by cholley20

For Slurm, there may be more info in the Slurm log file: /var/log/supervisor/slurmd-stdout---supervisor-Iqdv0D.log (your filename won't look exactly like this). BTW, there's no real benefit in restarting Galaxy during this process.

Re. Munge - see what's showing up in /var/log/supervisor/munge-stdout---supervisor-rfQeP_.log (or similar).

Finally, I added docs on how to retrieve and load your credentials into CloudLaunch: https://galaxyproject.org/cloud/jetstream/allocation/#api-access

ADD REPLYlink written 15 months ago by Enis Afgan690
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 173 users visited in the last hour