-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
A4 support for prod #412
base: develop
Are you sure you want to change the base?
A4 support for prod #412
Conversation
This PR is blocked by #416 |
- name: "cpu" | ||
nominalQuota: 10000 | ||
- name: "memory" | ||
nominalQuota: 10000Gi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if we should remove these resources. We need them to schedule workloads requesting CPUs and Memory. I think maybe removing Gi
from nominalQuota: 10000Gi
will solve the error?
nominalQuota: 10000
spec: | ||
namespaceSelector: {} # match all. | ||
resourceGroups: | ||
- coveredResources: ["nvidia.com/gpu"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here, we need cpu
and memory
resources.
src/xpk/commands/kjob_common.py
Outdated
@@ -39,6 +39,6 @@ def add_gpu_networking_annotations_to_command(args, cmd: str) -> str: | |||
|
|||
if gpu_type == H100_MEGA_DEVICE_TYPE: | |||
return add_tcpxo_annotations(args, cmd) | |||
if gpu_type == H200_DEVICE_TYPE: | |||
if gpu_type == H200_DEVICE_TYPE or gpu_type == B200_DEVICE_TYPE: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe if gpu_type in [H200_DEVICE_TYPE, B200_DEVICE_TYPE]
?
src/xpk/commands/kjob_common.py
Outdated
@@ -39,6 +39,6 @@ def add_gpu_networking_annotations_to_command(args, cmd: str) -> str: | |||
|
|||
if gpu_type == H100_MEGA_DEVICE_TYPE: | |||
return add_tcpxo_annotations(args, cmd) | |||
if gpu_type == H200_DEVICE_TYPE: | |||
if gpu_type == H200_DEVICE_TYPE or gpu_type == B200_DEVICE_TYPE: | |||
return add_rdma_annotations(args, cmd) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would change this method to accept device_type: add_rdma_annotations(args, cmd, gpu_type)
because it's possible the subnetwork_names for A3Ultra and A4 will be the same; also mitigates the confusion that why we add a3ultra subnetworks for A4.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this was a mistake by me, we should add A4 networks for A4...
I tweaked this part a bit to not overcomplicate things. Let me know what you think!
they are required and most likely not the cause of the error
Fixes / Features