Skip to content

Commit

Permalink
support defining environment variables from configmap keys
Browse files Browse the repository at this point in the history
  • Loading branch information
dgrove-oss committed Jan 17, 2025
1 parent b272ab3 commit 7ef1e89
Show file tree
Hide file tree
Showing 5 changed files with 194 additions and 2 deletions.
2 changes: 1 addition & 1 deletion tools/pytorchjob-generator/chart/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@ customize the Jobs generated by the tool.

| Key | Type | Default | Description |
|-----|------|---------|-------------|
| environmentVariables | array | `nil` | List of variables/values to be defined for all the ranks. Values can be literals or references to Kuberetes secrets. See [values.yaml](values.yaml) for examples of supported syntaxes. NOTE: The following standard [PyTorch Distributed environment variables](https://pytorch.org/docs/stable/distributed.html#environment-variable-initialization) are set automatically and can be referenced in the commands without being set manually: WORLD_SIZE, RANK, MASTER_ADDR, MASTER_PORT. |
| environmentVariables | array | `nil` | List of variables/values to be defined for all the ranks. Values can be literals or references to Kuberetes secrets or configmaps. See [values.yaml](values.yaml) for examples of supported syntaxes. NOTE: The following standard [PyTorch Distributed environment variables](https://pytorch.org/docs/stable/distributed.html#environment-variable-initialization) are set automatically and can be referenced in the commands without being set manually: WORLD_SIZE, RANK, MASTER_ADDR, MASTER_PORT. |
| sshGitCloneConfig | object | `nil` | Private GitHub clone support. See [values.yaml](values.yaml) for additional instructions. |
| setupCommands | array | no custom commands are executed | List of custom commands to be ran at the beginning of the execution. Use `setupCommand` to clone code, download data, and change directories. |
| mainProgram | string | `nil` | Name of the PyTorch program to be executed by `torchrun`. Please provide your program name here and NOT in "setupCommands" as this helm template provides the necessary "torchrun" arguments for the parallel execution. WARNING: this program is relative to the current path set by change-of-directory commands in "setupCommands". If no value is provided; then only `setupCommands` are executed and torchrun is elided. |
Expand Down
5 changes: 5 additions & 0 deletions tools/pytorchjob-generator/chart/templates/_helpers.tpl
Original file line number Diff line number Diff line change
Expand Up @@ -110,6 +110,11 @@ env:
secretKeyRef:
name: {{ required "Missing 'name' in 'environmentVariables.secret' list element" $variable.secret.name }}
key: {{ required "Missing 'key' in 'environmentVariables.secret' list element" $variable.secret.key | quote }}
{{- else if $variable.configmap }}
valueFrom:
configMapKeyRef:
name: {{ required "Missing 'name' in 'environmentVariables.configmap' list element" $variable.configmap.name }}
key: {{ required "Missing 'key' in 'environmentVariables.configmap' list element" $variable.configmap.key | quote }}
{{- else if ( kindIs "float64" $variable.value ) }}
value: "0"
{{- else }}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -1512,3 +1512,167 @@ scheduler can be set:
- emptyDir:
medium: Memory
name: dshm
user-defined environment variables:
1: |
apiVersion: workload.codeflare.dev/v1beta2
kind: AppWrapper
metadata:
annotations:
workload.codeflare.dev.mlbatch/pytorchGeneratorVersion: 1.1.6
labels:
kueue.x-k8s.io/queue-name: default-queue
name: my-job
namespace: my-namespace
spec:
components:
- template:
apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
name: my-job
spec:
pytorchReplicaSpecs:
Master:
replicas: 1
restartPolicy: Never
template:
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: autopilot.ibm.com/gpuhealth
operator: NotIn
values:
- ERR
- TESTING
- EVICT
containers:
- command:
- sh
- -c
- |
echo "Environment variables set by the kubeflow training operator:"
echo ${MASTER_ADDR}:${MASTER_PORT}
echo "PYTHONUNBUFFERED:"${PYTHONUNBUFFERED}
echo My global rank is ${RANK} / ${WORLD_SIZE}
echo "Other injected environment variables:"
echo "NVME_MOUNT_PATH: "${NVME_MOUNT_PATH}
#
# User commands
#
git clone https://github.com/dbarnett/python-helloworld
cd python-helloworld
echo executing: torchrun --nnodes=${WORLD_SIZE} --node_rank=${RANK} --nproc_per_node=8 --rdzv_id=101 --rdzv_endpoint="${MASTER_ADDR}:${MASTER_PORT}" helloworld.py
torchrun --nnodes=${WORLD_SIZE} --node_rank=${RANK} --nproc_per_node=8 --rdzv_id=101 --rdzv_endpoint="${MASTER_ADDR}:${MASTER_PORT}" helloworld.py
env:
- name: EXAMPLE_VAR1
value: "6"
- name: EXAMPLE_VAR2
value: example2string
- name: EXAMPLE_VAR3
valueFrom:
secretKeyRef:
key: my-secret-key
name: my-secret-name
- name: EXAMPLE_VAR4
valueFrom:
configMapKeyRef:
key: my-configmap-key
name: my-configmap-name
image: ghcr.io/foundation-model-stack/base:pytorch-latest-nightly-20230126
imagePullPolicy: IfNotPresent
name: pytorch
resources:
limits:
cpu: 500m
memory: 1Gi
nvidia.com/gpu: 8
nvidia.com/roce_gdr: 0
requests:
cpu: 500m
memory: 1Gi
nvidia.com/gpu: 8
nvidia.com/roce_gdr: 0
volumeMounts:
- mountPath: /dev/shm
name: dshm
imagePullSecrets: []
priorityClassName: default-priority
volumes:
- emptyDir:
medium: Memory
name: dshm
Worker:
replicas: 3
restartPolicy: Never
template:
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: autopilot.ibm.com/gpuhealth
operator: NotIn
values:
- ERR
- TESTING
- EVICT
containers:
- command:
- sh
- -c
- |
echo "Environment variables set by the kubeflow training operator:"
echo ${MASTER_ADDR}:${MASTER_PORT}
echo "PYTHONUNBUFFERED:"${PYTHONUNBUFFERED}
echo My global rank is ${RANK} / ${WORLD_SIZE}
echo "Other injected environment variables:"
echo "NVME_MOUNT_PATH: "${NVME_MOUNT_PATH}
#
# User commands
#
git clone https://github.com/dbarnett/python-helloworld
cd python-helloworld
echo executing: torchrun --nnodes=${WORLD_SIZE} --node_rank=${RANK} --nproc_per_node=8 --rdzv_id=101 --rdzv_endpoint="${MASTER_ADDR}:${MASTER_PORT}" helloworld.py
torchrun --nnodes=${WORLD_SIZE} --node_rank=${RANK} --nproc_per_node=8 --rdzv_id=101 --rdzv_endpoint="${MASTER_ADDR}:${MASTER_PORT}" helloworld.py
env:
- name: EXAMPLE_VAR1
value: "6"
- name: EXAMPLE_VAR2
value: example2string
- name: EXAMPLE_VAR3
valueFrom:
secretKeyRef:
key: my-secret-key
name: my-secret-name
- name: EXAMPLE_VAR4
valueFrom:
configMapKeyRef:
key: my-configmap-key
name: my-configmap-name
image: ghcr.io/foundation-model-stack/base:pytorch-latest-nightly-20230126
imagePullPolicy: IfNotPresent
name: pytorch
resources:
limits:
cpu: 500m
memory: 1Gi
nvidia.com/gpu: 8
nvidia.com/roce_gdr: 0
requests:
cpu: 500m
memory: 1Gi
nvidia.com/gpu: 8
nvidia.com/roce_gdr: 0
volumeMounts:
- mountPath: /dev/shm
name: dshm
imagePullSecrets: []
priorityClassName: default-priority
volumes:
- emptyDir:
medium: Memory
name: dshm
19 changes: 19 additions & 0 deletions tools/pytorchjob-generator/chart/tests/helloworld_test.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -101,6 +101,25 @@ tests:
- matchSnapshot:
path: spec.components[0].template

- it: user-defined environment variables
set:
environmentVariables:
- name: EXAMPLE_VAR1
value: 6
- name: EXAMPLE_VAR2
value: "example2string"
- name: EXAMPLE_VAR3
secret:
name: my-secret-name
key: my-secret-key
- name: EXAMPLE_VAR4
configmap:
name: my-configmap-name
key: my-configmap-key
asserts:
- matchSnapshot:
path: spec.components[0].template

- it: Enabling RoCE GDR
set:
roceGdrResName: nvidia.com/roce_gdr
Expand Down
6 changes: 5 additions & 1 deletion tools/pytorchjob-generator/chart/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -81,7 +81,7 @@ limitMemoryPerPod: # <optional, default=totalMemoryPerPod> Limit of total memory


# -- (array) List of variables/values to be defined for all the ranks. Values can be literals or
# references to Kuberetes secrets. See [values.yaml](values.yaml) for examples of supported syntaxes.
# references to Kuberetes secrets or configmaps. See [values.yaml](values.yaml) for examples of supported syntaxes.
#
# NOTE: The following standard [PyTorch Distributed environment variables](https://pytorch.org/docs/stable/distributed.html#environment-variable-initialization)
# are set automatically and can be referenced in the commands without being set manually: WORLD_SIZE, RANK, MASTER_ADDR, MASTER_PORT.
Expand All @@ -95,6 +95,10 @@ environmentVariables:
# secret:
# name: secret-name
# key: secret-key
# - name: EXAMPLE_VAR4
# configmap:
# name: configmap-name
# key: configmap-key

# Private GitHub clone support.
#
Expand Down

0 comments on commit 7ef1e89

Please # to comment.