Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

A4 support for prod #412

Open
wants to merge 36 commits into
base: develop
Choose a base branch
from
Open

A4 support for prod #412

wants to merge 36 commits into from

Conversation

gcie
Copy link
Collaborator

@gcie gcie commented Mar 7, 2025

Fixes / Features

  • A4 support for prod

@gcie gcie added the release-features features label Mar 7, 2025
@gcie gcie self-assigned this Mar 7, 2025
@gcie gcie marked this pull request as ready for review March 10, 2025 12:26
@gcie gcie marked this pull request as draft March 11, 2025 10:57
@gcie gcie marked this pull request as ready for review March 11, 2025 12:05
@gcie gcie marked this pull request as draft March 11, 2025 12:05
@gcie
Copy link
Collaborator Author

gcie commented Mar 12, 2025

This PR is blocked by #416

@gcie gcie marked this pull request as ready for review March 26, 2025 12:02
- name: "cpu"
nominalQuota: 10000
- name: "memory"
nominalQuota: 10000Gi
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if we should remove these resources. We need them to schedule workloads requesting CPUs and Memory. I think maybe removing Gi from nominalQuota: 10000Gi will solve the error?

nominalQuota: 10000

spec:
namespaceSelector: {} # match all.
resourceGroups:
- coveredResources: ["nvidia.com/gpu"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here, we need cpu and memory resources.

@@ -39,6 +39,6 @@ def add_gpu_networking_annotations_to_command(args, cmd: str) -> str:

if gpu_type == H100_MEGA_DEVICE_TYPE:
return add_tcpxo_annotations(args, cmd)
if gpu_type == H200_DEVICE_TYPE:
if gpu_type == H200_DEVICE_TYPE or gpu_type == B200_DEVICE_TYPE:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe if gpu_type in [H200_DEVICE_TYPE, B200_DEVICE_TYPE] ?

@@ -39,6 +39,6 @@ def add_gpu_networking_annotations_to_command(args, cmd: str) -> str:

if gpu_type == H100_MEGA_DEVICE_TYPE:
return add_tcpxo_annotations(args, cmd)
if gpu_type == H200_DEVICE_TYPE:
if gpu_type == H200_DEVICE_TYPE or gpu_type == B200_DEVICE_TYPE:
return add_rdma_annotations(args, cmd)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would change this method to accept device_type: add_rdma_annotations(args, cmd, gpu_type)
because it's possible the subnetwork_names for A3Ultra and A4 will be the same; also mitigates the confusion that why we add a3ultra subnetworks for A4.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this was a mistake by me, we should add A4 networks for A4...

I tweaked this part a bit to not overcomplicate things. Let me know what you think!

gcie added 3 commits April 2, 2025 10:33
they are required and most likely not the cause of the error
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants