-
Notifications
You must be signed in to change notification settings - Fork 271
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
set PASS_DEVICE_SPECS ENV to device-plugin #690
Conversation
…PECS=true as an environment variable. Signed-off-by: 张 驰 <919474320@qq.com>
could you discuss this PR next weekly meeting? |
ok,I'm in the WeChat group, and you can remind me during the weekly meeting. |
感谢提交这个 PR,我花了一些时间了解了相关的上下文,以下是我对这个问题和解决方案的理解,以及改进建议: 问题背景在容器化 GPU 工作负载中,尤其是在使用
问题原因看了前面几个相关的 Issue 和 NVIDIA 官方 Issue(NVIDIA/gpu-operator#485),比较核心的原因应该是:
解决方案:为啥
|
感谢回复,既然设置这个变量有效是否可以考虑把默认值设为true? |
@jingzhe6414 确实,我也觉得可以默认开启 主要是我看最新的 helm 已经有特权了,默认开启应该是没啥额外风险 而且虽然 NVIDIA 的 所以我目前看起来 |
…_SPECS ENV to passDeviceSpecsEnabled, and set the default value to true. Signed-off-by: 张 驰 <919474320@qq.com>
set PASS_DEVICE_SPECS ENV to device-plugin
What type of PR is this?
/kind bug
What this PR does / why we need it:
Mitigate the issue of GPU unavailability during the pod runtime.
Which issue(s) this PR fixes:
Fixes #
#658 Containerized GPU workloads may suddenly lose access to their GPUs. This situation occurs when systemd is used to manage the cgroups of the container and it is triggered to reload any Unit files that have references to NVIDIA GPUs (e.g. with something as simple as a systemctl daemon-reload).
Special notes for your reviewer:
https://github.com/NVIDIA/gpu-operator/issues/485According to the solution provided in the k8s-nvidia-plugin, adding the PASS_DEVICE_SPECS ENV to the plugin was tested and found that not setting privileged can also mitigate the issue.
Additionally, I did not find the usage of PASS_DEVICE_SPECS in hami-device-plugin, but I did find traces of its usage in nvidia's k8s-device-plugin. This is quite puzzling as to why it would work, and I haven't found a reasonable explanation yet. If you have any insights, please let me know.
Does this PR introduce a user-facing change?: yes