Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Saving checkpoints at interrupt #3

Open
Ridhamz-nd opened this issue Apr 8, 2024 · 5 comments
Open

Saving checkpoints at interrupt #3

Ridhamz-nd opened this issue Apr 8, 2024 · 5 comments

Comments

@Ridhamz-nd
Copy link

Thank you for providing example implementations!

I was wondering what signal is sent to the docker container when spot training job are interrupted. Is it SIGKILL or SIGTERM with some grace period (https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_StopTrainingJob.html)?

I was looking to implement a signal handler which, on SIGTERM, saves the latest checkpoint to S3. That way, resume happens from the exact point in time.
Is this possible? Do we need to account for the time it takes for the uploader service to upload the content of /opt/ml/checkpoints to the checkpoint_s3_uri?

Any guidelines on how to resume from the latest stop point is much appreciated

@Ridhamz-nd
Copy link
Author

@eitansela please lmk if this is not the correct location to ask this question and I can close this issue :)

@rst0git
Copy link

rst0git commented Apr 10, 2024

I was looking to implement a signal handler which, on SIGTERM, saves the latest checkpoint to S3. That way, resume happens from the exact point in time. Is this possible?

@Ridhamz-nd What would happen if SIGKILL is used instead? You would also need to make sure that a checkpoint is created only when it is necessary, not every time when SIGTERM is used as this may introduce significant performance overhead.

@Ridhamz-nd
Copy link
Author

@rst0git I don't think a signal handler can be attached to a SIGKILL signal (https://man7.org/linux/man-pages/man7/signal.7.html). Once a SIGKILL is sent, the process is terminated immediately. Based on the sagemaker docs, SIGTERM is only sent once with a grace period of 120s.

@eitansela
Copy link
Contributor

You should save checkpoint to /opt/ml/checkpoints after each EPOC, and SageMaker takes care to copy it to checkpoint_s3_uri for you. It is not a matter of speed because if it is a long training job of few hours or few days, why a SIGTERM will help here? If a Spot goes down, you lose few minutes of training and resume back after you have a new Spot, from the last checkpoint.

@Ridhamz-nd
Copy link
Author

So you are right in that I will only lose a few minutes of training if I'm training on one node.
However, if I am training on p4d/p5 instances which have a > 20% interruption rate in most regions and if I'm doing multi node training, then if one node is reclaimed, the whole job needs to be paused.
In this case, there can be too many interrupts.

Also, in general, its preferable to not lose training progress that a job may have made.
Currently we try to counter the lose progress issue by checkpointing frequently but that also has a cost (especially for large models) so its much more convenient if we get some sort of a signal that tells us that our job is going to be interrupted. If SIGTERM is that signal based on the docs, then we can save and resume from same point.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants