Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

nvidia.service arm64 support & fixes #2694

Open
wants to merge 8 commits into
base: main
Choose a base branch
from
Open

nvidia.service arm64 support & fixes #2694

wants to merge 8 commits into from

Conversation

jepio
Copy link
Member

@jepio jepio commented Feb 25, 2025

nvidia.service arm64 support & fixes

Add support for arm64 to nvidia.service and fix other related issues. Here is a brief overview of all changes:

  • coreos-modules shipped kmod build tools in /lib/modules that were not properly cross-compiled for arm64. They are now.
  • Fetch an architecture specific installer file from nvidia. See also: nvidia GPU drivers fail to install on arm64 (g5g.xlarge on AWS) Flatcar#1649
  • @njuettner reported that in some cases the current symlink may end up created as a directory, which breaks the unit silently. No idea how this can happen but handle this case.
  • newer nvidia driver versions like 570.x default to kernel-open modules (maybe only on arm64?), which the service is not ready handle (and won't be able to quickly due to a dependency on firmware files). Explicitly force non-kernel-open drivers.
  • Remove depmod generation. We didn't depend on this being done and this was broken because those parts of the filesystem are readonly in the nspawn container. See also: nvidia-driver sysext hides zfs modules Flatcar#1576
  • Make the devcontainer image sparse.
  • Add some more verbose output from the unit.

How to use

[ describe what reviewers need to do in order to validate this PR ]

Testing done

Tested on g5g.xlarge instances in AWS during development, but need to script this or repeat with the final PR result.
Jenkins build (covers Azure GPU instances) running here: http://jenkins.infra.kinvolk.io:8080/job/container/job/packages_all_arches/5457/cldsv/

[Describe the testing you have done before submitting this PR. Please include both the commands you issued as well as the output you got.]

  • Changelog entries added in the respective changelog/ directory (user-facing change, bug fix, security fix, update)
  • Inspected CI output for image differences: /boot and /usr size, packages, list files for any missing binaries, kernel modules, config files, kernel modules, etc.

@jepio jepio requested review from danzatt and a team February 25, 2025 16:05
@jepio jepio changed the title nvidia.service arm64 support & fix nvidia.service arm64 support & fixes Feb 25, 2025
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This ebuild probably should get a bump to -r1.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense in general, but would you be fine making an exception is this case since the kernel version is bumped every couple of days automatically?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.

@jepio jepio force-pushed the setup-nvidia-fixes branch from 3a64222 to 5e8b137 Compare March 5, 2025 15:16
jepio added 7 commits March 5, 2025 16:17
Use `uname -m` to fetch the correct driver installer for aarch64 or x86_64.

Signed-off-by: Jeremi Piotrowski <jpiotrowski@microsoft.com>
Users have reported that in some cases the nvidia.service fails because
/opt/nvidia/current is a directory and the symbolic link gets created inside
it. I have no idea how we get there, but to make the service robust in the face
of this kind of issue:

- remove the directory if it exists
- use `-T` with ln to ensure that symbolic link creation fails if `current` is a directory

Signed-off-by: Jeremi Piotrowski <jpiotrowski@microsoft.com>
This saves space at runtime.

Signed-off-by: Jeremi Piotrowski <jpiotrowski@microsoft.com>
Signed-off-by: Jeremi Piotrowski <jpiotrowski@microsoft.com>
Installers for 570 sometimes default to Open drivers, which we can't support
properly at this time. Force proprietary drivers. There are also additional
options that suppress certain worrisome error strings - enable those if
supported too.

Signed-off-by: Jeremi Piotrowski <jpiotrowski@microsoft.com>
The nspawn container runs in it's own scope, which journal output is then
associated with. By passing `--keep-unit` we can guarantee that all log output
will stay associated with the nvidia.service and can be viewed by running
`journalctl -u nvidia.service`.

Signed-off-by: Jeremi Piotrowski <jpiotrowski@microsoft.com>
So that we can pick-up kmods contained in sysexts (like zfs) and generate
complete module dependency information. I thought we could skip running depmod
for nvidia drivers because we manually insmod them, but nvidia's GPU operator
driver validation expects to be able to run modprobe - so we have to generate
them.

Signed-off-by: Jeremi Piotrowski <jpiotrowski@microsoft.com>
@jepio
Copy link
Member Author

jepio commented Mar 5, 2025

Remove depmod generation. We didn't depend on this being done and this was broken because those parts of the filesystem are readonly in the nspawn container. See also: flatcar/Flatcar#1576

Turns out we need the depmod information for NVIDIA GPU operator, so I removed this commit.

Copy link

github-actions bot commented Mar 5, 2025

@jepio
Copy link
Member Author

jepio commented Mar 5, 2025

I'm still iterating on testing this PR with this new kola test: flatcar/mantle#583

@jepio
Copy link
Member Author

jepio commented Mar 5, 2025

The R535 driver branch, which is LTS, does not compile on arm64 with GCC
14/kernel 6.6. Keep amd64 on R535 and switch arm64 to R570 by default.
R570 is the first driver version that I found that is currently
supported and works for arm64.

Signed-off-by: Jeremi Piotrowski <jpiotrowski@microsoft.com>
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants