-
Notifications
You must be signed in to change notification settings - Fork 675
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Getting GPU device minor number: Not Supported #332
Comments
You seem to be running the device plugin under WSL2. This is not currently a supported use case of the device plugin. The specific reason is that device nodes on WSL2 and Linux systems are not the same and as such the CPU Manager Workaround (which includes the device nodes in the container being launched) does not work as expected. |
All right, I follow this guide https://docs.nvidia.com/cuda/wsl-user-guide/index.html#getting-started-with-cuda-on-wsl , to install cuda on wsl , Seem the limitations, https://docs.nvidia.com/cuda/wsl-user-guide/index.html#known-limitations-for-linux-cuda-apps , There are no limit to install k8s on wsl and use ctr command run gpu as good as well |
@elezar would you guys put this into the roadmap? our company is running Windows but we wanted to transition into Linux, so WSL2 seems like a natural choice. We are running deep learning workload that requires CUDA support and while Docker Desktop does support GPU workload, it would be strange to not see this work in normal WSL2 containers as well |
Hi @elezar , |
@patrykkaj I think that in theory this could be done by outside contributors and is simplified by the recent changes to support Tegra-based systems. What I can see happening here is that:
Some things to note here:
If you feel comfortable creating an MR against https://gitlab.com/nvidia/kubernetes/device-plugin that adds this functionality, we can work together on getting it in. |
Hello, I was interested in this, and I adapted the plugin to work. I can try to do a clean version, but I don't really know how to correctly check if |
@Vinrobot thanks for the work here. Some thoughts on this: We recently moved away from I think the steps outlined in #332 (comment) should be considered as the starting point. Check if With regards to the following:
I don't think that this is required. If there are no NVIDIA GPUs available on the system then the NVML enumeration that is used to list the devices would not be expected to work. This should already be handled by the lower-level components of the NVIDIA container stack. |
Hi @elezar, I tried to make it work with the most recent version, but I got this error (on the pod)
which is caused by this line in gpu-monitoring-tools (still used by gpuallocator). As it's the same as before, I can re-use my custom version of gpu-monitoring-tools to make it work, but it's not the goal. |
@Vinrobot yes, it is an issue that The issue is the call to get alligned allocation here. (You can confirm this by removing this section). If this does workd, what we would need is a mechanism to disable this for WSL2 devices. One option would be to add a
(Note that this should still be discussed and could definitely be improved, but would be a good starting point). |
Hi @elezar, I'm also interested in running the device plugin with WSL2. Would be great to get those changes in. |
Thanks @achim92 -- I will have a look at the MR. Note that with the v1.13.0 release of the NVIDIA Container Toolkit we now support the generation of CDI specifications on WSL2 based systems. Support for consuming this and generating a spec for available devices was included in the v0.14.0 version of the device plugin. This was largely targeted at usage in the context of our GPU operator, but could be generalised to also support WSL2-based systems without requiring additional device plugin changes. |
hi @elezar,
|
Thanks @elezar, would be even better without requiring additional device plugin changes. I have generated cdi with
I also removed NVIDIA Container Runtime hook under
|
@elezar could you please give some guidance here? |
Hi brother, I've encountered the same issue. Have you managed to solve it? |
Note: We have https://gitlab.com/nvidia/kubernetes/device-plugin/-/merge_requests/291 under review from @achim92 to allow the device plugin to work under WSL2. Testing of the changes there would be welcomed. |
Hi @elezar , How can I test your changes? Do I need to create a new image and install the plugin to my k8s using https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.0/nvidia-device-plugin.yml as a template? Thanks |
@elezar We are also interested in this |
I believe |
✔️ registry.gitlab.com/nvidia/kubernetes/device-plugin/staging/k8s-device-plugin:8b416016 WSL environment
K8S Setup
nvidia-smi output in WSL
nvidia-device-plugin daemonset pod log
Test GPU pod outputUsed the example from https://docs.k3s.io/advanced#nvidia-container-runtime-support
Thank you @elezar . I hope this commit can be merged into this repo and published asap 🚀 ! |
@davidshen84 I can also confirm it works. However, we have to add some additional stuff:
Annotate the WSL node: nvidia.com/gpu-driver-upgrade-state: pod-restart-required
nvidia.com/gpu.count: '1'
nvidia.com/gpu.deploy.container-toolkit: 'true'
nvidia.com/gpu.deploy.dcgm: 'true'
nvidia.com/gpu.deploy.dcgm-exporter: 'true'
nvidia.com/gpu.deploy.device-plugin: 'true'
nvidia.com/gpu.deploy.driver: 'true'
nvidia.com/gpu.deploy.gpu-feature-discovery: 'true'
nvidia.com/gpu.deploy.node-status-exporter: 'true'
nvidia.com/gpu.deploy.nvsm: ''
nvidia.com/gpu.deploy.operands: 'true'
nvidia.com/gpu.deploy.operator-validator: 'true'
nvidia.com/gpu.present: 'true'
nvidia.com/device-plugin.config: 'RTX-4070-Ti' Change device plugin in ClusterPolicy: devicePlugin:
config:
name: time-slicing-config
enabled: true
env:
- name: PASS_DEVICE_SPECS
value: 'true'
- name: FAIL_ON_INIT_ERROR
value: 'true'
- name: DEVICE_LIST_STRATEGY
value: envvar
- name: DEVICE_ID_STRATEGY
value: uuid
- name: NVIDIA_VISIBLE_DEVICES
value: all
- name: NVIDIA_DRIVER_CAPABILITIES
value: all
image: k8s-device-plugin
imagePullPolicy: IfNotPresent
repository: registry.gitlab.com/nvidia/kubernetes/device-plugin/staging
version: 8b416016 It should work for now:
|
I created the "runtimeClassName" resource and added the "runtimeClassName"
property to the pods.
```
apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
name: nvidia
handler: nvidia
```
I did not add those properties you mentioned. Why do I need them?
Thanks
…On Tue, 25 Jul 2023 at 19:32, wizpresso-steve-cy-fan < ***@***.***> wrote:
@davidshen84 <https://github.com/davidshen84> I can also confirm it
works. However, we have to add some additional stuff:
$ touch /run/nvidia/validations/toolkit-ready
$ touch /run/nvidia/validations/driver-ready
$ mkdir -p /run/nvidia/driver/dev
$ ln -s /run/nvidia/driver/dev/dxg /dev/dxg
Annotate the WSL node:
nvidia.com/gpu-driver-upgrade-state: pod-restart-required
nvidia.com/gpu.count: '1'
nvidia.com/gpu.deploy.container-toolkit: 'true'
nvidia.com/gpu.deploy.dcgm: 'true'
nvidia.com/gpu.deploy.dcgm-exporter: 'true'
nvidia.com/gpu.deploy.device-plugin: 'true'
nvidia.com/gpu.deploy.driver: 'true'
nvidia.com/gpu.deploy.gpu-feature-discovery: 'true'
nvidia.com/gpu.deploy.node-status-exporter: 'true'
nvidia.com/gpu.deploy.nvsm: ''
nvidia.com/gpu.deploy.operands: 'true'
nvidia.com/gpu.deploy.operator-validator: 'true'
nvidia.com/gpu.present: 'true'
nvidia.com/device-plugin.config: 'RTX-4070-Ti'
Change device plugin in ClusterPolicy:
devicePlugin:
config:
name: time-slicing-config
enabled: true
env:
- name: PASS_DEVICE_SPECS
value: 'true'
- name: FAIL_ON_INIT_ERROR
value: 'true'
- name: DEVICE_LIST_STRATEGY
value: envvar
- name: DEVICE_ID_STRATEGY
value: uuid
- name: NVIDIA_VISIBLE_DEVICES
value: all
- name: NVIDIA_DRIVER_CAPABILITIES
value: all
image: k8s-device-plugin
imagePullPolicy: IfNotPresent
repository: registry.gitlab.com/nvidia/kubernetes/device-plugin/staging
version: 8b41601
It should work for now
—
Reply to this email directly, view it on GitHub
<#332 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAAQBTIBSURYWEGHQ4R5RIDXR6HERANCNFSM6AAAAAAQF6PUHY>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
@davidshen84 Because I used the gpu-operator for automatic GPU provision |
Thanks for the tip!
…On Tue, 25 Jul 2023 at 19:46, wizpresso-steve-cy-fan < ***@***.***> wrote:
@davidshen84 <https://github.com/davidshen84> Because I used the
gpu-operator <https://github.com/NVIDIA/gpu-operator> for automatic GPU
provision
—
Reply to this email directly, view it on GitHub
<#332 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAAQBTN72GY4JN43V7XZULLXR6IW7ANCNFSM6AAAAAAQF6PUHY>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
I verified the staging image Based on dockerdStep 1, install k3s cluster based on dockerd curl -sfL https://get.k3s.io | sh -s - --docker Step 2, install dp with the staging image. # set RuntimeClass
cat <<EOF | kubectl apply -f -
apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
name: nvidia
handler: docker
EOF
# install nvdp
helm repo add nvdp https://nvidia.github.io/k8s-device-plugin
helm upgrade -i nvdp nvdp/nvidia-device-plugin \
--namespace nvdp \
--create-namespace \
--set=runtimeClassName=nvidia \
--set=image.repository=registry.gitlab.com/nvidia/kubernetes/device-plugin/staging/k8s-device-plugin \
--set=image.tag=8b416016 Based on containerdStep 1, install k3s cluster based on containerd curl -sfL https://get.k3s.io | sh - Step 2, install dp with the staging image. # set RuntimeClass
cat <<EOF | kubectl apply -f -
apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
name: nvidia
handler: nvidia # change the handler to `nvidia` for containerd
EOF
# install nvdp with the same steps as above. Test with nvdpcat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
name: gpu-pod
spec:
restartPolicy: Never
runtimeClassName: nvidia
containers:
- name: cuda-container
image: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2
resources:
limits:
nvidia.com/gpu: 1 # requesting 1 GPU
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
EOF And, the example |
Hi @elezar, I saw this PR has been merged in the upstream repository for a long time. What's the plan to publish this on GitHub? |
Hi @elezar, I can confirm
|
Tested and documented in qbo with:
https://docs.qbo.io/#/ai_and_ml?id=kubeflow Thanks to @achim92 contrib and @elezar approval :)
This fix also works for kind kubernetes using and
More details see here: kubernetes-sigs/kind#3257 (comment) A couple of notes for the gpu-operatorLabelsNvidia GPU operator requires a manual label: The label can be added as follows:
The reson is that WSL2 doesn't contains PCI info under I believe the relevant code is here: node-feature-discovery/source/usb/utils.go:106 I believe
I believe the right place to add this label is once the driver has been detected in the host. See here https://gitlab.com/nvidia/kubernetes/gpu-operator/-/merge_requests/881/diffs I'll add my comments there. Docker Image for device-pluginI built a new image based on https://gitlab.com/nvidia/kubernetes/device-plugin/-/merge_requests/291 for testing purposes but also working with the one provided here #332 (comment)
Docker Image for gpu-operatorI created docker image with changes similar to this https://gitlab.com/nvidia/kubernetes/gpu-operator/-/merge_requests/881/diffs Docker Image for gpu-operator-validatorBlogs on how to install: Nvidia GPU Operator + Kubeflow + Docker in Docker + cgroups v2 (In Linux and Windows WSL2) |
Thank you for working on this, now that WSL2 supports systemd I think more people will be running k8s on Windows. |
Just a general note: We will release a |
hi @elezar any update on when the v0.15.0-rc.1 is going to be out? |
v0.15.0-rc1 successfully enabled my scenario today: https://github.com/mrjohnsonalexander/classic TL;DR Stack notes
|
This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed. |
1. Issue or feature description
helm install nvidia-device-plugin
nvidia-device-plugin-ctr logs
When I use ctr to run test gpu is ok
3. Information to attach (optional if deemed irrelevant)
Common error checking:
nvidia-smi -a
on your host/etc/docker/daemon.json
)sudo journalctl -r -u kubelet
)Additional information that might help better understand your environment and reproduce the bug:
dmesg
dpkg -l '*nvidia*'
orrpm -qa '*nvidia*'
nvidia-container-cli -V
continaerd config containerd.toml
The text was updated successfully, but these errors were encountered: