-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
nvidia.service arm64 support & fixes #2694
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This ebuild probably should get a bump to -r1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense in general, but would you be fine making an exception is this case since the kernel version is bumped every couple of days automatically?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure.
3a64222
to
5e8b137
Compare
Use `uname -m` to fetch the correct driver installer for aarch64 or x86_64. Signed-off-by: Jeremi Piotrowski <jpiotrowski@microsoft.com>
Users have reported that in some cases the nvidia.service fails because /opt/nvidia/current is a directory and the symbolic link gets created inside it. I have no idea how we get there, but to make the service robust in the face of this kind of issue: - remove the directory if it exists - use `-T` with ln to ensure that symbolic link creation fails if `current` is a directory Signed-off-by: Jeremi Piotrowski <jpiotrowski@microsoft.com>
This saves space at runtime. Signed-off-by: Jeremi Piotrowski <jpiotrowski@microsoft.com>
Signed-off-by: Jeremi Piotrowski <jpiotrowski@microsoft.com>
Installers for 570 sometimes default to Open drivers, which we can't support properly at this time. Force proprietary drivers. There are also additional options that suppress certain worrisome error strings - enable those if supported too. Signed-off-by: Jeremi Piotrowski <jpiotrowski@microsoft.com>
The nspawn container runs in it's own scope, which journal output is then associated with. By passing `--keep-unit` we can guarantee that all log output will stay associated with the nvidia.service and can be viewed by running `journalctl -u nvidia.service`. Signed-off-by: Jeremi Piotrowski <jpiotrowski@microsoft.com>
So that we can pick-up kmods contained in sysexts (like zfs) and generate complete module dependency information. I thought we could skip running depmod for nvidia drivers because we manually insmod them, but nvidia's GPU operator driver validation expects to be able to run modprobe - so we have to generate them. Signed-off-by: Jeremi Piotrowski <jpiotrowski@microsoft.com>
5e8b137
to
21665e5
Compare
Turns out we need the depmod information for NVIDIA GPU operator, so I removed this commit. |
Build action triggered: https://github.com/flatcar/scripts/actions/runs/13683502957 |
I'm still iterating on testing this PR with this new kola test: flatcar/mantle#583 |
Azure test on NC8as_T4_v3: green http://jenkins.infra.kinvolk.io:8080/job/container/job/test/32700/console Testing used this mantle build: |
...r/src/third_party/coreos-overlay/x11-drivers/nvidia-drivers/nvidia-drivers-535.230.02.ebuild
Outdated
Show resolved
Hide resolved
...r/src/third_party/coreos-overlay/x11-drivers/nvidia-drivers/nvidia-drivers-535.230.02.ebuild
Outdated
Show resolved
Hide resolved
...er/src/third_party/coreos-overlay/x11-drivers/nvidia-drivers/nvidia-drivers-570.86.15.ebuild
Outdated
Show resolved
Hide resolved
The R535 driver branch, which is LTS, does not compile on arm64 with GCC 14/kernel 6.6. Keep amd64 on R535 and switch arm64 to R570 by default. R570 is the first driver version that I found that is currently supported and works for arm64. Signed-off-by: Jeremi Piotrowski <jpiotrowski@microsoft.com>
76585b7
to
3e2f797
Compare
nvidia.service arm64 support & fixes
Add support for arm64 to nvidia.service and fix other related issues. Here is a brief overview of all changes:
current
symlink may end up created as a directory, which breaks the unit silently. No idea how this can happen but handle this case.Remove depmod generation. We didn't depend on this being done and this was broken because those parts of the filesystem are readonly in the nspawn container. See also: nvidia-driver sysext hides zfs modules Flatcar#1576How to use
[ describe what reviewers need to do in order to validate this PR ]
Testing done
Tested on g5g.xlarge instances in AWS during development, but need to script this or repeat with the final PR result.
Jenkins build (covers Azure GPU instances) running here: http://jenkins.infra.kinvolk.io:8080/job/container/job/packages_all_arches/5457/cldsv/
[Describe the testing you have done before submitting this PR. Please include both the commands you issued as well as the output you got.]
changelog/
directory (user-facing change, bug fix, security fix, update)/boot
and/usr
size, packages, list files for any missing binaries, kernel modules, config files, kernel modules, etc.