Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Can't install nvidia driver via nvidia.service #1283

Closed
tearfulDalvik opened this issue Dec 10, 2023 · 7 comments
Closed

Can't install nvidia driver via nvidia.service #1283

tearfulDalvik opened this issue Dec 10, 2023 · 7 comments
Labels
kind/bug Something isn't working

Comments

@tearfulDalvik
Copy link

Description

Can't install nvidia driver via nvidia.service

Environment and steps to reproduce

  1. Set-up:
    Flatcar stable installed in VMWare ESXi 8.0.2 using ova import, then manually upgrade to flatcar beta
DISTRIB_ID="Flatcar Container Linux by Kinvolk"
DISTRIB_RELEASE=3760.1.0
DISTRIB_CODENAME="Oklo"
DISTRIB_DESCRIPTION="Flatcar Container Linux by Kinvolk 3760.1.0 (Oklo)"
NAME="Flatcar Container Linux by Kinvolk"
ID=flatcar
ID_LIKE=coreos
VERSION=3760.1.0
VERSION_ID=3760.1.0
BUILD_ID=2023-11-20-1827
SYSEXT_LEVEL=1.0
PRETTY_NAME="Flatcar Container Linux by Kinvolk 3760.1.0 (Oklo)"
ANSI_COLOR="38;5;75"
HOME_URL="https://flatcar.org/"
BUG_REPORT_URL="https://issues.flatcar.org"
FLATCAR_BOARD="amd64-usr"
CPE_NAME="cpe:2.3:o:flatcar-linux:flatcar_linux:3760.1.0:*:*:*:*:*:*:*"
  1. Task:
    journalctl -u nvidia -f
  2. Action(s):
    a. Assigned a P40-6Q GPU to flatcar vm
    b.journalctl -u nvidia -f
  3. Error: [describe the error that was triggered]
Dec 10 09:09:15 flatcar-rke2-worker-2 systemd[1]: Starting nvidia.service - NVIDIA Configure Service...
Dec 10 09:09:15 flatcar-rke2-worker-2 setup-nvidia[18335]: Downloading Flatcar Container Linux Developer Container for version: 3760.1.0
Dec 10 09:09:16 flatcar-rke2-worker-2 setup-nvidia[18398]:   % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
Dec 10 09:09:16 flatcar-rke2-worker-2 setup-nvidia[18398]:                                  Dload  Upload   Total   Spent    Left  Speed
Dec 10 09:09:30 flatcar-rke2-worker-2 setup-nvidia[18398]: [1.2K blob data]
Dec 10 09:09:41 flatcar-rke2-worker-2 setup-nvidia[18335]: Downloading NVIDIA 535.104.05 Driver
Dec 10 09:09:41 flatcar-rke2-worker-2 setup-nvidia[19275]:   % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
Dec 10 09:09:41 flatcar-rke2-worker-2 setup-nvidia[19275]:                                  Dload  Upload   Total   Spent    Left  Speed
Dec 10 09:10:09 flatcar-rke2-worker-2 setup-nvidia[19275]: [2.3K blob data]
Dec 10 09:10:09 flatcar-rke2-worker-2 setup-nvidia[18335]: Extract the NVIDIA Driver Installer 535.104.05
Dec 10 09:10:09 flatcar-rke2-worker-2 setup-nvidia[18335]: /opt/nvidia/workdir/nvidia-workdir /
Dec 10 09:10:09 flatcar-rke2-worker-2 setup-nvidia[20411]: Creating directory NVIDIA-Linux-x86_64-535.104.05
Dec 10 09:10:09 flatcar-rke2-worker-2 setup-nvidia[20411]: Verifying archive integrity... OK
Dec 10 09:10:10 flatcar-rke2-worker-2 setup-nvidia[20411]: Uncompressing NVIDIA Accelerated Graphics Driver for Linux-x86_64 535.104.05
Dec 10 09:10:12 flatcar-rke2-worker-2 setup-nvidia[20448]: ..................................................................................................................................>Dec 10 09:10:12 flatcar-rke2-worker-2 setup-nvidia[18335]: /
Dec 10 09:10:12 flatcar-rke2-worker-2 setup-nvidia[18335]: Spawn system-nspawn container to install the NVIDIA drivers
Dec 10 09:10:12 flatcar-rke2-worker-2 sudo[20540]:     root : PWD=/ ; USER=root ; COMMAND=/usr/bin/systemd-nspawn --read-only --volatile=overlay --image=/opt/nvidia/workdir/flatcar_develope>Dec 10 09:10:12 flatcar-rke2-worker-2 sudo[20540]: pam_unix(sudo:session): session opened for user root(uid=0) by (uid=0)
Dec 10 09:10:48 flatcar-rke2-worker-2 setup-nvidia[18335]: /opt/nvidia /
Dec 10 09:10:48 flatcar-rke2-worker-2 setup-nvidia[18335]: /
Dec 10 09:10:48 flatcar-rke2-worker-2 setup-nvidia[30179]: ldconfig: /lib/ld.so.conf is not an ELF file - it has the wrong magic bytes at the start.
Dec 10 09:10:48 flatcar-rke2-worker-2 setup-nvidia[18335]: /opt/nvidia/current/usr/lib/modules/6.1.62-flatcar/video /
Dec 10 09:10:48 flatcar-rke2-worker-2 setup-nvidia[30183]: insmod: ERROR: could not insert module nvidia.ko: No such device
Dec 10 09:10:48 flatcar-rke2-worker-2 systemd[1]: nvidia.service: Main process exited, code=exited, status=1/FAILURE
Dec 10 09:10:48 flatcar-rke2-worker-2 systemd[1]: nvidia.service: Failed with result 'exit-code'.
Dec 10 09:10:48 flatcar-rke2-worker-2 systemd[1]: Failed to start nvidia.service - NVIDIA Configure Service.
@jepio
Copy link
Member

jepio commented Dec 11, 2023

hi @tearfulDalvik,

could you paste the last couple lines of dmesg after running sudo /usr/lib/nvidia/bin/setup-nvidia? "No such device" suggests that your GPU might be unsupported by the default driver version and you might want to explicitly select a different one.

@jepio
Copy link
Member

jepio commented Dec 11, 2023

Checking nvidia's driver download page you might want to try selecting nvidia driver version 440.95.01. See the instructions here: https://www.flatcar.org/docs/latest/setup/customization/using-nvidia/#customization

@tearfulDalvik
Copy link
Author

tearfulDalvik commented Dec 11, 2023

hello @jepio,

Thank you, it seems vGPUs aren't supported. vGPUs need NVIDIA GRID drivers instead of normal linux drivers
Also, may I know how to undo the nvidia.service&setup-nvidia installation?

sudo /usr/lib/nvidia/bin/setup-nvidia
ldconfig: /lib/ld.so.conf is not an ELF file - it has the wrong magic bytes at the start.

/opt/nvidia/current/usr/lib/modules/6.1.62-flatcar/video /home/core
insmod: ERROR: could not insert module nvidia.ko: No such device

dmesg:

[130852.530069] nvidia 0000:02:00.0: vgaarb: changed VGA decodes: olddecodes=none,decodes=none:owns=none
[130852.530546] NVRM: The NVIDIA GPU 0000:02:00.0 (PCI ID: 10de:1b38)
                NVRM: installed in this system is not supported by the
                NVRM: NVIDIA 535.104.05 driver release.
                NVRM: Please see 'Appendix A - Supported NVIDIA GPU Products'
                NVRM: in this release's README, available on the operating system
                NVRM: specific graphics driver download page at www.nvidia.com.
[130852.532087] nvidia: probe of 0000:02:00.0 failed with error -1
[130852.532299] NVRM: The NVIDIA probe routine failed for 1 device(s).
[130852.532499] NVRM: None of the NVIDIA devices were initialized.
[130852.532957] nvidia-nvlink: Unregistered Nvlink Core, major device number 246

@jepio
Copy link
Member

jepio commented Dec 12, 2023

I don't think we ever investigated GRID/vGPU drivers, as those require licensing.

To undo you can remove /opt/nvidia and systemctl mask --now nvidia.service.

@tearfulDalvik
Copy link
Author

Totally understandable. Thank you very much.

@sayanchowdhury
Copy link
Member

One closing query: Did you manually initiate the nvidia.service, or did it trigger automatically?

@tearfulDalvik
Copy link
Author

One closing query: Did you manually initiate the nvidia.service, or did it trigger automatically?

Hello,
It is triggered automatically

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
kind/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants