2.9 years ago by
I think the Pulsar client will survive pulsar being taken down for restarts and stuff without killing the job prematurely - so the issue is probably with problems during the transfer at the beginning and end of jobs. Is this what you have observed?
Certainly there are a whole range of actions which pulsar should be configurable to retry - but it isn't unfortunately. I have created a card here https://github.com/galaxyproject/pulsar/issues/56.
When pulsar initiates actions it can be configured to retry things - so if in your destination configuration in galaxy's job_conf.xml you set <param id="default_file_action">remote_transfer</param> - instead of Galaxy trying to send files to Pulsar - Pulsar will try to pull the files from Galaxy. Then you can set the following parameters in Pulsar's server.ini to control retrying these transfer actions.
This doesn't really help you with problems sending the initial setup message to Pulsar - if that doesn't work everything fails. Galaxy can be configured to place the setup message in a message queue - but this is completely untested with a Windows-based Pulsar server. Also - unfortunately none of the remote transfer stuff has been tested with a Windows-based Pulsar server.
Sorry I don't have better news - all of the recent efforts toward make pulsar more resilient have been aimed at using the message queue and exclusively tested under *nix systems. If you do put in the effort to try to get these things to work under Windows - I will be happy to assist with that (https://github.com/galaxyproject/pulsar/issues/57).