Skip to content
This repository was archived by the owner on Mar 21, 2024. It is now read-only.

SSL is not picking up recovery checkpoints #608

Closed
ant0nsc opened this issue Dec 7, 2021 · 1 comment
Closed

SSL is not picking up recovery checkpoints #608

ant0nsc opened this issue Dec 7, 2021 · 1 comment

Comments

@ant0nsc
Copy link
Contributor

ant0nsc commented Dec 7, 2021

When SSL jobs get pre-empted, they seem to start from afresh.

As a first step, add diagnostics. Print out all checkpoints that are found, to see if a recovered job actually sees the previously written checkpoints

AB#4774

@ant0nsc
Copy link
Contributor Author

ant0nsc commented Dec 14, 2021

Bug in AML restart functionality reported. Added #614 as a workaround.

@ant0nsc ant0nsc closed this as completed Dec 14, 2021
# for free to subscribe to this conversation on GitHub. Already have an account? #.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant