SageMaker @remote function: Added multi-node functionality #4984

brunopistone · 2025-01-06T15:39:38Z

Issue #, if available:

Description of changes: Added the possibility to execute distributed SageMaker Training jobs across multiple nodes (multiple instances), by loading distributed environment variables

Testing done: Unit tests for remote_function, added 2 additional unit tests for single node and multi-node

Merge Checklist

Put an x in the boxes that apply. You can also fill these out after creating the PR. If you're unsure about any of them, don't hesitate to ask. We're here to help! This is simply a reminder of what we are going to look for before merging your pull request.

General

I have read the CONTRIBUTING doc
I certify that the changes I am introducing will be backward compatible, and I have discussed concerns about this, if any, with the Python SDK team
I used the commit message format described in CONTRIBUTING
I have passed the region in to all S3 and STS clients that I've initialized as part of this change.
I have updated any necessary documentation, including READMEs and API docs (if appropriate)

Tests

I have added tests that prove my fix is effective or that my feature works (if appropriate)
I have added unit and/or integration tests as appropriate to ensure backward compatibility of the changes
I have checked that my tests are not configured for a specific region or account (if appropriate)
I have used unique_name_from_base to create resource names in integ tests (if appropriate)
If adding any dependency in requirements.txt files, I have spell checked and ensured they exist in PyPi

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

…thon-sdk

nargokul · 2025-01-16T19:08:30Z

src/sagemaker/remote_function/client.py

@@ -91,7 +91,7 @@ def remote(
    use_spot_instances=False,
    max_wait_time_in_seconds=None,
    use_torchrun=False,
-    nproc_per_node=1,
+    nproc_per_node: Optional[int] = None,


What does nproc stand for ? Can we use the unabbreviated string for the parameter ?

nargokul · 2025-01-16T19:12:46Z

src/sagemaker/remote_function/runtime_environment/bootstrap_runtime_environment.py

+    except OSError:
+        logger.info("No Neurons detected (normal if no neurons installed)")
+        return 0
+    except subprocess.CalledProcessError as e:
+        if e.output is not None:
+            try:
+                msg = e.output.decode("utf-8").partition("error=")[2]
+                logger.info(
+                    "No Neurons detected (normal if no neurons installed). \
+                    If neuron installed then %s",
+                    msg,
+                )
+            except AttributeError:
+                logger.info("No Neurons detected (normal if no neurons installed)")
+        else:
+            logger.info("No Neurons detected (normal if no neurons installed)")
+
+        return 0


Can we add unit tests for these ?
Or this file in general in a separate test file ?

Added 3 unit tests for the environment bootstrap:

single instance with CPU

single instance with multi GPUs

multiple instances with multi GPUs

brunopistone added 2 commits January 4, 2025 17:01

implemented multi-node distribution with @Remote function

841af92

completed unit tests

4fe2747

brunopistone requested a review from a team as a code owner January 6, 2025 15:39

brunopistone requested a review from chad119 January 6, 2025 15:39

brunopistone had a problem deploying to manual-approval January 6, 2025 15:39 — with GitHub Actions Error

added distributed training with CPU and torchrun

fa79639

brunopistone had a problem deploying to manual-approval January 8, 2025 23:27 — with GitHub Actions Error

Merge branch 'master' into master

43547b0

brunopistone had a problem deploying to manual-approval January 13, 2025 18:14 — with GitHub Actions Error

brunopistone added 2 commits January 14, 2025 21:51

backwards compatibility nproc_per_node

06ab509

Merge branch 'master' of https://github.com/brunopistone/sagemaker-py…

3a03c4b

…thon-sdk

brunopistone temporarily deployed to auto-approve January 14, 2025 21:53 — with GitHub Actions Inactive

Merge branch 'master' into master

bc5918a

brunopistone temporarily deployed to auto-approve January 14, 2025 23:06 — with GitHub Actions Inactive

fixing code: permissions for non-root users, integration tests

7d54096

brunopistone temporarily deployed to auto-approve January 15, 2025 11:40 — with GitHub Actions Inactive

fixed docstyle

423c585

brunopistone temporarily deployed to auto-approve January 15, 2025 14:23 — with GitHub Actions Inactive

refactor nproc_per_node for backwards compatibility

adcc38e

brunopistone temporarily deployed to auto-approve January 15, 2025 18:56 — with GitHub Actions Inactive

refactor nproc_per_node for backwards compatibility

00eb637

brunopistone temporarily deployed to auto-approve January 15, 2025 21:58 — with GitHub Actions Inactive

pylint fix, newlines

0dea502

brunopistone temporarily deployed to auto-approve January 16, 2025 07:37 — with GitHub Actions Inactive

nargokul reviewed Jan 16, 2025

View reviewed changes

added unit tests for bootstrap_environment remote

b152915

brunopistone temporarily deployed to auto-approve January 16, 2025 21:31 — with GitHub Actions Inactive

nargokul approved these changes Jan 16, 2025

View reviewed changes

nargokul merged commit ae3cc1c into aws:master Jan 16, 2025
13 of 14 checks passed

brunopistone mentioned this pull request Jan 21, 2025

mpirun protocol - distributed training with @remote decorator #4998

Merged

benieric mentioned this pull request Jan 21, 2025

fix: Add missing attributes to local resourceconfig #4999

Merged

10 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

SageMaker @remote function: Added multi-node functionality #4984

SageMaker @remote function: Added multi-node functionality #4984

Uh oh!

brunopistone commented Jan 6, 2025

Uh oh!

nargokul Jan 16, 2025

Uh oh!

nargokul Jan 16, 2025

Uh oh!

brunopistone Jan 16, 2025

Uh oh!

Uh oh!

Uh oh!

SageMaker @remote function: Added multi-node functionality #4984

SageMaker @remote function: Added multi-node functionality #4984

Uh oh!

Conversation

brunopistone commented Jan 6, 2025

Merge Checklist

General

Tests

Uh oh!

nargokul Jan 16, 2025

Choose a reason for hiding this comment

Uh oh!

nargokul Jan 16, 2025

Choose a reason for hiding this comment

Uh oh!

brunopistone Jan 16, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!