-
Notifications
You must be signed in to change notification settings - Fork 1.2k
feature: python module support to torch_distributed #4324
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have we tested end to end with both toolkit changes and this? Let's make sure toolkit changes are released first
Hi @rahul003, I was able to confirm the fixes work, you're right we need to release toolkit changes first. Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/Bot run all
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
ab8dd53
to
1c2910e
Compare
Is anyone available to help with this please? |
Issue #, if available:
Related to aws/sagemaker-training-toolkit#205
Description of changes:
torch_distributed
usestorchrun
which supports running python modules via-m <MODULE>
. Currently SageMaker limits torch_distributed to scripts only. This change will allow submitting the training jobs withsagemaker_program="-m <MODULE>"
Testing done:
Successfully submitted the SageMaker training job with the right parameterss.
The training job will still fail in runtime because sagemaker-training-toolkit need to be updated:
Merge Checklist
Put an
x
in the boxes that apply. You can also fill these out after creating the PR. If you're unsure about any of them, don't hesitate to ask. We're here to help! This is simply a reminder of what we are going to look for before merging your pull request.General
Tests
unique_name_from_base
to create resource names in integ tests (if appropriate)By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.