Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

[k8s] Fix GKELabelFormatter for H100s #3627

Merged
merged 2 commits into from
Jun 4, 2024
Merged

[k8s] Fix GKELabelFormatter for H100s #3627

merged 2 commits into from
Jun 4, 2024

Conversation

romilbhardwaj
Copy link
Collaborator

get_gke_accelerator_name returns the incorrect label for H100 (nvidia-tesla-h100 instead of nvidia-h100). This PR fixes it by removing the incorrect conditional check for H100.

Tested:

  • sky launch --gpus H100:1 on a GKE cluster with H100s.

@romilbhardwaj
Copy link
Collaborator Author

H100s are a hard to get on GKE, so to test this PR I mocked it on my sky local up cluster with:

kubectl proxy

curl --header "Content-Type: application/json-patch+json" \
  --request PATCH \
  --data '[{"op": "add", "path": "/status/capacity/nvidia.com~1gpu", "value": "8"}]' \
  http://localhost:8001/api/v1/nodes/skypilot-control-plane/status


kubectl label nodes skypilot-control-plane cloud.google.com/gke-accelerator=nvidia-h100-80gb

With that, sky launch --gpus H100:1 works as expected, and the gpu name is now H100 (instead of H100-80GB).

@romilbhardwaj romilbhardwaj merged commit 0ebc5fd into master Jun 4, 2024
20 checks passed
@romilbhardwaj romilbhardwaj deleted the gkeh100fix branch June 4, 2024 04:40
Michaelvll pushed a commit that referenced this pull request Aug 23, 2024
* H100-80gb does not exist, fix to H100

* Fix H100 support
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants