Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Feature request: detect when spot instances are pre-empted and re-submit #100

Open
aksarkar opened this issue Jul 22, 2024 · 3 comments
Open

Comments

@aksarkar
Copy link

We typically run Batch workflows on AWS spot instances to take advantage of cost savings when possible.

However, when some redun task is interrupted due to its host being terminated, the scheduler halts leading to potentially a lot of lost work.

It would be helpful for redun to detect this case and re-submit the task without halting, up to the configured maximum number of re-submits.

@aksarkar
Copy link
Author

Alternatively, it would be helpful for the scheduler to stop submitting new tasks, and waiting for all in flight tasks remaining before exiting.

@mattrasmus
Copy link
Collaborator

Does using the retries option work for your use case? https://insitro.github.io/redun/config.html#retries

You can also catch errors using the catch() task. This can be done to implement more dynamic retry or recovery workflows.

Alternatively, it would be helpful for the scheduler to stop submitting new tasks, and waiting for all in flight tasks remaining before exiting.

For behavior like this, we have something called catch_all(). It works for the specific case of evaluating tasks in a list, accumulating errors, and at the end allowing the user to decide what to do (fail, partial retry, etc).

I have thought about whether it's possible to define a different mode for error propagation in general. The current mode is eager raising, where one task failing causes all sibling active jobs to be abandoned, leading to the workflow to halt. One could image an opt-in to allowing sibling tasks to finish as much as possible before terminating the workflow. If you have ideas on syntax or examples from other workflow engines, I would be interested in ideas.

@aksarkar
Copy link
Author

Regarding retries: I am unsure that this will do what I want, which is to re-submit a job only when the last line of the Cloudwatch logs indicate that the instance was terminated.

In cases where there was an unrecoverable error (MemoryError, AssertionError, etc.) I do want the behavior where the workflow halts (eventually).

When using catch, am I able to easily get the Cloudwatch logs for the failing job?

Regarding the alternative I mentioned, one example is GNU parallel --halt soon,fail=1. The behavior in redun is analagous to --halt now,fail=1.

I would suggest implementing it as options in the redun.ini scheduler section.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Development

No branches or pull requests

2 participants