A4 support for prod #412

gcie · 2025-03-07T10:14:52Z

Fixes / Features

A4 support for prod

gcie · 2025-03-12T15:22:46Z

This PR is blocked by #416

they caused errors when creating clusters

sharabiani · 2025-04-01T10:37:55Z

src/xpk/blueprints/a3ultra/kueue-xpk-configuration.yaml.tftpl

-      - name: "cpu"
-        nominalQuota: 10000
-      - name: "memory"
-        nominalQuota: 10000Gi


Not sure if we should remove these resources. We need them to schedule workloads requesting CPUs and Memory. I think maybe removing Gi from nominalQuota: 10000Gi will solve the error?

nominalQuota: 10000

sharabiani · 2025-04-01T10:38:58Z

src/xpk/blueprints/a4/kueue-xpk-configuration.yaml.tftpl

+spec:
+  namespaceSelector: {} # match all.
+  resourceGroups:
+  - coveredResources: ["nvidia.com/gpu"]


same here, we need cpu and memory resources.

sharabiani · 2025-04-01T12:46:34Z

src/xpk/commands/kjob_common.py

@@ -39,6 +39,6 @@ def add_gpu_networking_annotations_to_command(args, cmd: str) -> str:

  if gpu_type == H100_MEGA_DEVICE_TYPE:
    return add_tcpxo_annotations(args, cmd)
-  if gpu_type == H200_DEVICE_TYPE:
+  if gpu_type == H200_DEVICE_TYPE or gpu_type == B200_DEVICE_TYPE:


maybe if gpu_type in [H200_DEVICE_TYPE, B200_DEVICE_TYPE] ?

sharabiani · 2025-04-01T12:57:52Z

src/xpk/commands/kjob_common.py

@@ -39,6 +39,6 @@ def add_gpu_networking_annotations_to_command(args, cmd: str) -> str:

  if gpu_type == H100_MEGA_DEVICE_TYPE:
    return add_tcpxo_annotations(args, cmd)
-  if gpu_type == H200_DEVICE_TYPE:
+  if gpu_type == H200_DEVICE_TYPE or gpu_type == B200_DEVICE_TYPE:
    return add_rdma_annotations(args, cmd)


I would change this method to accept device_type: add_rdma_annotations(args, cmd, gpu_type)
because it's possible the subnetwork_names for A3Ultra and A4 will be the same; also mitigates the confusion that why we add a3ultra subnetworks for A4.

I think this was a mistake by me, we should add A4 networks for A4...

I tweaked this part a bit to not overcomplicate things. Let me know what you think!

they are required and most likely not the cause of the error

sharabiani and others added 9 commits February 5, 2025 13:53

A4 on staging support added

9259f01

Merge branch 'develop' into a4-preview

61683a4

A key for gke endpoint added to configs

13f0164

get_cluster_credentials moved to kubectl module

5b73273

common module added to core

b81a7f3

common module file added to core

9844a73

A4 blueprnt updated

7fb020a

Merge branch 'develop' into HEAD

1c38e61

update ctk to 1.47.0

5a3d19d

gcie added the release-features features label Mar 7, 2025

gcie self-assigned this Mar 7, 2025

gcie added 7 commits March 7, 2025 11:15

Merge branch 'develop' into a4-preview

a07df4c

fix merge issues

d72fbc7

fix unittests: a3ultra template

c011820

fix unittests: missing slash

a15dcce

a4 prod support

4b06518

Merge branch 'develop' into a4-preview

3b71a1d

fix: linting

2db662e

gcie marked this pull request as ready for review March 10, 2025 12:26

gcie requested review from Obliviour, 44past4, sharabiani, pawloch00, BluValor and RoshaniN as code owners March 10, 2025 12:26

gcie added 3 commits March 10, 2025 15:15

update cluster toolkit to 1.47.0

f6615e1

update missing ctk version in blueprint_generator.py

29330f3

fix: missing 'v'

2ace838

gcie marked this pull request as draft March 11, 2025 10:57

Merge branch 'gcie-ctk-update' into a4-preview

ca90a3e

gcie marked this pull request as ready for review March 11, 2025 12:05

gcie marked this pull request as draft March 11, 2025 12:05

gcie added 7 commits March 26, 2025 09:34

Merge branch 'develop' into a4-preview

91787f3

Merge branch 'develop' into a4-preview

8124814

remove -lowmem from b200 system characteristics

7da8e6b

fix linting

eb7a8df

fix linting, again

643e6a4

fix integration tests

9daf920

add a4 tests (unit+integration)

5b10533

gcie marked this pull request as ready for review March 26, 2025 12:02

gcie added 6 commits March 26, 2025 12:03

fix workloads on a4

830f4b3

fix: invalid annotation bug

4f391ab

remove unused artifact

a495567

add rdma annotations for A4 in slurm mode

2bdc5a1

remove unused kueue resources quotas

f6f75c5

they caused errors when creating clusters

Merge branch 'develop' into a4-preview

b916c0f

sharabiani reviewed Apr 1, 2025

View reviewed changes

gcie added 3 commits April 2, 2025 10:33

bring back removed resources

79d7dd9

they are required and most likely not the cause of the error

review fixes

0653eed

fix tests

03843bd

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A4 support for prod #412

A4 support for prod #412

gcie commented Mar 7, 2025

gcie commented Mar 12, 2025

sharabiani Apr 1, 2025

sharabiani Apr 1, 2025

sharabiani Apr 1, 2025

sharabiani Apr 1, 2025

gcie Apr 2, 2025

A4 support for prod #412

Are you sure you want to change the base?

A4 support for prod #412

Conversation

gcie commented Mar 7, 2025

Fixes / Features

gcie commented Mar 12, 2025

sharabiani Apr 1, 2025

Choose a reason for hiding this comment

sharabiani Apr 1, 2025

Choose a reason for hiding this comment

sharabiani Apr 1, 2025

Choose a reason for hiding this comment

sharabiani Apr 1, 2025

Choose a reason for hiding this comment

gcie Apr 2, 2025

Choose a reason for hiding this comment