Question: Keep Alive for Communication Between Pulsar and Galaxy
1
gravatar for bornea27
3.7 years ago by
bornea2720
United States
bornea2720 wrote:

I have a linux server running the galaxy that needs to communicate to a windows machine using pulsar for a specific job the job times can be hours and the network at the hospital I work at can be unreliable. I need to find a way to either increase the keep alive so that when the network hiccups it will not cause the job in galaxy to come back as a error, because the job on the windows machine will complete no matter what the network does after getting the needed data.

Just for clarification wall time configured in the job_conf is not the issue. I am looking for a way to either check in periodically and only fail after x number of failed check ins or just keep running unless it has not heard from the server for over x amount of minutes. Any help is greatly appreciated.

galaxy • 931 views
ADD COMMENTlink modified 3.7 years ago by jmchilton1.1k • written 3.7 years ago by bornea2720
2
gravatar for jmchilton
3.7 years ago by
jmchilton1.1k
United States
jmchilton1.1k wrote:

I think the Pulsar client will survive pulsar being taken down for restarts and stuff without killing the job prematurely - so the issue is probably with problems during the transfer at the beginning and end of jobs. Is this what you have observed?

Certainly there are a whole range of actions which pulsar should be configurable to retry - but it isn't unfortunately. I have created a card here https://github.com/galaxyproject/pulsar/issues/56.

When pulsar initiates actions it can be configured to retry things - so if in your destination configuration in galaxy's job_conf.xml you set <param id="default_file_action">remote_transfer</param> - instead of Galaxy trying to send files to Pulsar - Pulsar will try to pull the files from Galaxy. Then you can set the following parameters in Pulsar's server.ini to control retrying these transfer actions.

preprocess_action_max_retries
preprocess_action_interval_start
preprocess_action_interval_step
preprocess_action_interval_stop
postprocess_action_max_retries
postprocess_action_interval_start
postprocess_action_interval_step
postprocess_action_interval_stop

This doesn't really help you with problems sending the initial setup message to Pulsar - if that doesn't work everything fails. Galaxy can be configured to place the setup message in a message queue - but this is completely untested with a Windows-based Pulsar server. Also - unfortunately none of the remote transfer stuff has been tested with a Windows-based Pulsar server.

Sorry I don't have better news - all of the recent efforts toward make pulsar more resilient have been aimed at using the message queue and exclusively tested under *nix systems. If you do put in the effort to try to get these things to work under Windows - I will be happy to assist with that (https://github.com/galaxyproject/pulsar/issues/57).

-John

ADD COMMENTlink written 3.7 years ago by jmchilton1.1k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 186 users visited in the last hour