Feature request: detect when spot instances are pre-empted and re-submit #100

aksarkar · 2024-07-22T20:40:41Z

We typically run Batch workflows on AWS spot instances to take advantage of cost savings when possible.

However, when some redun task is interrupted due to its host being terminated, the scheduler halts leading to potentially a lot of lost work.

It would be helpful for redun to detect this case and re-submit the task without halting, up to the configured maximum number of re-submits.

aksarkar · 2024-07-23T15:42:08Z

Alternatively, it would be helpful for the scheduler to stop submitting new tasks, and waiting for all in flight tasks remaining before exiting.

mattrasmus · 2024-07-23T22:40:17Z

Does using the retries option work for your use case? https://insitro.github.io/redun/config.html#retries

You can also catch errors using the catch() task. This can be done to implement more dynamic retry or recovery workflows.

Alternatively, it would be helpful for the scheduler to stop submitting new tasks, and waiting for all in flight tasks remaining before exiting.

For behavior like this, we have something called catch_all(). It works for the specific case of evaluating tasks in a list, accumulating errors, and at the end allowing the user to decide what to do (fail, partial retry, etc).

I have thought about whether it's possible to define a different mode for error propagation in general. The current mode is eager raising, where one task failing causes all sibling active jobs to be abandoned, leading to the workflow to halt. One could image an opt-in to allowing sibling tasks to finish as much as possible before terminating the workflow. If you have ideas on syntax or examples from other workflow engines, I would be interested in ideas.

aksarkar · 2024-07-24T21:06:38Z

Regarding retries: I am unsure that this will do what I want, which is to re-submit a job only when the last line of the Cloudwatch logs indicate that the instance was terminated.

In cases where there was an unrecoverable error (MemoryError, AssertionError, etc.) I do want the behavior where the workflow halts (eventually).

When using catch, am I able to easily get the Cloudwatch logs for the failing job?

Regarding the alternative I mentioned, one example is GNU parallel --halt soon,fail=1. The behavior in redun is analagous to --halt now,fail=1.

I would suggest implementing it as options in the redun.ini scheduler section.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature request: detect when spot instances are pre-empted and re-submit #100

Feature request: detect when spot instances are pre-empted and re-submit #100

aksarkar commented Jul 22, 2024

aksarkar commented Jul 23, 2024

mattrasmus commented Jul 23, 2024

aksarkar commented Jul 24, 2024

Feature request: detect when spot instances are pre-empted and re-submit #100

Feature request: detect when spot instances are pre-empted and re-submit #100

Comments

aksarkar commented Jul 22, 2024

aksarkar commented Jul 23, 2024

mattrasmus commented Jul 23, 2024

aksarkar commented Jul 24, 2024