Question: CloudMan not launching worker nodes properly
0
gravatar for sarah.hein.0
13 months ago by
sarah.hein.00 wrote:

Hello there! I have been trying to use launch a Galaxy CloudMan cluster for the past several days, but I just cannot get it to work correctly. I have primarily been using the new interface, and can get it to launch the instances no problem. The problem comes in that it won't scale up the worker nodes.

Every time CloudMan launches a new node, it creates the instance (confirmed from EC2 interface), but then doesn't seem to be able to communicate with it. It will then try to reboot the node several times over ~20 minutes, before finally giving up and killing the node. Worse, the main node then has already stopped work and tried to offload some of the compute, and doesn't pick it back up again properly. This becomes a permanent problem for the whole cluster.

I tried launching multiple times with various configurations, but just cannot figure out what I am doing wrong. I also tried launching from the old interface, but that doesn't seem to be communicating with my AWS account at all. I tried different keypairs with full admin access, creating a new subnet, running an older version of Cloudman, etc. Nothing works.

I finally just spun up 1 very large instance to try to churn through the data, but it is very slow going!! I desperately need to be able to scale my nodes to process this data!

software error galaxy • 430 views
ADD COMMENTlink modified 13 months ago • written 13 months ago by sarah.hein.00
1
gravatar for Enis Afgan
13 months ago by
Enis Afgan690
United States
Enis Afgan690 wrote:

Thanks for reporting this. There was a bug in the communication setup between the master and worker nodes. It's been corrected now and the CloudLaunch server updated so things should start working again for any newly launched cluster. If you want to fix an already running cluster, it'll be necessary to log into the AWS console and, under security groups, find 'cloudlaunch-cm' group and edit the inbound rules to add the rule enabling open communication among instances running in the same security group, as per the attached screenshot (just note that your security group ID will be different).

enter image description here

ADD COMMENTlink written 13 months ago by Enis Afgan690
0
gravatar for sarah.hein.0
13 months ago by
sarah.hein.00 wrote:

Oh awesome!!! I thought I was going absolutely bonkers, and that I was just doing something wrong. This is incredibly helpful.

I've hit a point in processing that I think I will just download my data and relaunch in the morning with a more optimized configuration. I'll report back to confirm that everything is working to spec.

Thank you!

ADD COMMENTlink written 13 months ago by sarah.hein.00
0
gravatar for sarah.hein.0
13 months ago by
sarah.hein.00 wrote:

One other note. I noticed that the most recent version of Cloudman/Galaxy being launced is version 17.X, but that it's pulling from the Cloudman-Test bucket by default. I tried to change it to the standard Cloudman release, as there were a few bugs in Galaxy itself (some of the packages not running correctly), but that resulted in a completely failed launch. Any ideas on whether there could be a few more bugs introduced in the changeover that could be causing this?

ADD COMMENTlink written 13 months ago by sarah.hein.00

There is an incompatibility with the older CloudMan releases so the latest one was pulling the code out of the test bucket as you noticed. With Galaxy 17.09 released now, I'll need to update the CloudMan release as well so can also fix the issues you mention. Could you please add the issues you've come across to this Github issue and I'll try to address them with the next release: https://github.com/galaxyproject/cloudman/issues/73

ADD REPLYlink written 12 months ago by Enis Afgan690
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 182 users visited in the last hour