3.7 years ago by
United States
I think the Pulsar client will survive pulsar being taken down for restarts and stuff without killing the job prematurely - so the issue is probably with problems during the transfer at the beginning and end of jobs. Is this what you have observed?
Certainly there are a whole range of actions which pulsar should be configurable to retry - but it isn't unfortunately. I have created a card here https://github.com/galaxyproject/pulsar/issues/56.
When pulsar initiates actions it can be configured to retry things - so if in your destination configuration in galaxy's job_conf.xml you set <param id="default_file_action">remote_transfer</param> - instead of Galaxy trying to send files to Pulsar - Pulsar will try to pull the files from Galaxy. Then you can set the following parameters in Pulsar's server.ini to control retrying these transfer actions.
preprocess_action_max_retries
preprocess_action_interval_start
preprocess_action_interval_step
preprocess_action_interval_stop
postprocess_action_max_retries
postprocess_action_interval_start
postprocess_action_interval_step
postprocess_action_interval_stop
This doesn't really help you with problems sending the initial setup message to Pulsar - if that doesn't work everything fails. Galaxy can be configured to place the setup message in a message queue - but this is completely untested with a Windows-based Pulsar server. Also - unfortunately none of the remote transfer stuff has been tested with a Windows-based Pulsar server.
Sorry I don't have better news - all of the recent efforts toward make pulsar more resilient have been aimed at using the message queue and exclusively tested under *nix systems. If you do put in the effort to try to get these things to work under Windows - I will be happy to assist with that (https://github.com/galaxyproject/pulsar/issues/57).
-John