-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Feature request: detect when spot instances are pre-empted and re-submit #100
Comments
Alternatively, it would be helpful for the scheduler to stop submitting new tasks, and waiting for all in flight tasks remaining before exiting. |
Does using the You can also catch errors using the catch() task. This can be done to implement more dynamic retry or recovery workflows.
For behavior like this, we have something called catch_all(). It works for the specific case of evaluating tasks in a list, accumulating errors, and at the end allowing the user to decide what to do (fail, partial retry, etc). I have thought about whether it's possible to define a different mode for error propagation in general. The current mode is eager raising, where one task failing causes all sibling active jobs to be abandoned, leading to the workflow to halt. One could image an opt-in to allowing sibling tasks to finish as much as possible before terminating the workflow. If you have ideas on syntax or examples from other workflow engines, I would be interested in ideas. |
Regarding In cases where there was an unrecoverable error ( When using Regarding the alternative I mentioned, one example is GNU parallel I would suggest implementing it as options in the |
We typically run Batch workflows on AWS spot instances to take advantage of cost savings when possible.
However, when some redun task is interrupted due to its host being terminated, the scheduler halts leading to potentially a lot of lost work.
It would be helpful for redun to detect this case and re-submit the task without halting, up to the configured maximum number of re-submits.
The text was updated successfully, but these errors were encountered: