Keep Alive for Communication Between Pulsar and Galaxy

Question: Keep Alive for Communication Between Pulsar and Galaxy

3.7 years ago by

United States

bornea27 • 20 wrote:

I have a linux server running the galaxy that needs to communicate to a windows machine using pulsar for a specific job the job times can be hours and the network at the hospital I work at can be unreliable. I need to find a way to either increase the keep alive so that when the network hiccups it will not cause the job in galaxy to come back as a error, because the job on the windows machine will complete no matter what the network does after getting the needed data.

Just for clarification wall time configured in the job_conf is not the issue. I am looking for a way to either check in periodically and only fail after x number of failed check ins or just keep running unless it has not heard from the server for over x amount of minutes. Any help is greatly appreciated.

galaxy • 931 views

ADD COMMENT • link •

modified 3.7 years ago by jmchilton ♦ 1.1k • written 3.7 years ago by bornea27 • 20

3.7 years ago by

jmchilton ♦ 1.1k

United States

jmchilton ♦ 1.1k wrote:

I think the Pulsar client will survive pulsar being taken down for restarts and stuff without killing the job prematurely - so the issue is probably with problems during the transfer at the beginning and end of jobs. Is this what you have observed?

Certainly there are a whole range of actions which pulsar should be configurable to retry - but it isn't unfortunately. I have created a card here https://github.com/galaxyproject/pulsar/issues/56.

When pulsar initiates actions it can be configured to retry things - so if in your destination configuration in galaxy's job_conf.xml you set <param id="default_file_action">remote_transfer</param> - instead of Galaxy trying to send files to Pulsar - Pulsar will try to pull the files from Galaxy. Then you can set the following parameters in Pulsar's server.ini to control retrying these transfer actions.

preprocess_action_max_retries
preprocess_action_interval_start
preprocess_action_interval_step
preprocess_action_interval_stop
postprocess_action_max_retries
postprocess_action_interval_start
postprocess_action_interval_step
postprocess_action_interval_stop

This doesn't really help you with problems sending the initial setup message to Pulsar - if that doesn't work everything fails. Galaxy can be configured to place the setup message in a message queue - but this is completely untested with a Windows-based Pulsar server. Also - unfortunately none of the remote transfer stuff has been tested with a Windows-based Pulsar server.

Sorry I don't have better news - all of the recent efforts toward make pulsar more resilient have been aimed at using the message queue and exclusively tested under *nix systems. If you do put in the effort to try to get these things to work under Windows - I will be happy to assist with that (https://github.com/galaxyproject/pulsar/issues/57).

-John

ADD COMMENT • link written 3.7 years ago by jmchilton ♦ 1.1k

Similar posts • Search »