-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Saving checkpoints at interrupt #3
Comments
@eitansela please lmk if this is not the correct location to ask this question and I can close this issue :) |
@Ridhamz-nd What would happen if SIGKILL is used instead? You would also need to make sure that a checkpoint is created only when it is necessary, not every time when SIGTERM is used as this may introduce significant performance overhead. |
@rst0git I don't think a signal handler can be attached to a SIGKILL signal (https://man7.org/linux/man-pages/man7/signal.7.html). Once a SIGKILL is sent, the process is terminated immediately. Based on the sagemaker docs, SIGTERM is only sent once with a grace period of 120s. |
You should save checkpoint to |
So you are right in that I will only lose a few minutes of training if I'm training on one node. Also, in general, its preferable to not lose training progress that a job may have made. |
Thank you for providing example implementations!
I was wondering what signal is sent to the docker container when spot training job are interrupted. Is it SIGKILL or SIGTERM with some grace period (https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_StopTrainingJob.html)?
I was looking to implement a signal handler which, on SIGTERM, saves the latest checkpoint to S3. That way, resume happens from the exact point in time.
Is this possible? Do we need to account for the time it takes for the uploader service to upload the content of
/opt/ml/checkpoints
to thecheckpoint_s3_uri
?Any guidelines on how to resume from the latest stop point is much appreciated
The text was updated successfully, but these errors were encountered: