Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

ublue-nvctk-cdi.service should be refactored to udev rule #180

Open
m2Giles opened this issue Dec 14, 2023 · 6 comments
Open

ublue-nvctk-cdi.service should be refactored to udev rule #180

m2Giles opened this issue Dec 14, 2023 · 6 comments
Labels
bug Something isn't working

Comments

@m2Giles
Copy link
Member

m2Giles commented Dec 14, 2023

In the Nvidia images, we have the ublue-nvctk-cdi.service to support containers.

The only dependencies this service has is if the binary exists, is executable, and we are after local-fs.target. This is problematic because it will always run even if the Nvidia modules are not loaded due to an Nvidia card not being present. For eGPUs, the Nvidia card is not present until much later in the boot process. Instead of using a service, this should be handled via udev rule since this script is dependent on the necessary hardware being present. Right now with an eGPU, you have to manually restart the service before entering any containers.

I'll try converting the service to a udev rule to test.

@bsherman
Copy link
Contributor

A related concern was reported in Discord ( https://discord.com/channels/1072614816579063828/1072617059265032342/1232829046036103231 ) where if the nvidia GPU has been disabled (for example, BIOS disabled dGPU on a dual GPU laptop), then this fails erroneously.

I should finally fix this bug.

@bsherman bsherman self-assigned this Apr 25, 2024
@bsherman bsherman added the bug Something isn't working label Apr 25, 2024
@bsherman bsherman moved this to Todo in Project Goals Apr 25, 2024
@m2Giles
Copy link
Member Author

m2Giles commented Apr 25, 2024

This will also fail if the nvidia card isn't "ready". We've seen internal A4000 also throw this error.

@Sharkitty
Copy link

Hello! I'm the user mentionned by @bsherman
The system this happened on is running a custom image based on ublue-kinoite-nvidia image (No nvidia related change applied downstream of ublue, only surface stuff so far). As described, the dGPU is disabled in BIOS when this happens, no error in Hybrid mode. This is the systemd log of the failed service:

× ublue-nvctk-cdi.service - ublue nvidia container toolkit CDI auto-generation
     Loaded: loaded (/usr/lib/systemd/system/ublue-nvctk-cdi.service; enabled; preset: enabled)
    Drop-In: /usr/lib/systemd/system/service.d
             └─10-timeout-abort.conf
     Active: failed (Result: exit-code) since Thu 2024-04-25 16:49:45 CEST; 2h 27min ago
   Main PID: 5074 (code=exited, status=1/FAILURE)
        CPU: 28ms

Apr 25 16:49:45 fedora systemd[1]: Starting ublue-nvctk-cdi.service - ublue nvidia container toolkit CDI auto-generation...
Apr 25 16:49:45 fedora nvidia-ctk[5074]: time="2024-04-25T16:49:45+02:00" level=info msg="Auto-detected mode as \"nvml\""
Apr 25 16:49:45 fedora nvidia-ctk[5074]: time="2024-04-25T16:49:45+02:00" level=error msg="failed to generate CDI spec: failed to create device CDI specs: failed to initialize NVML: ERROR_DRIVER_NOT_LOADED"
Apr 25 16:49:45 fedora systemd[1]: ublue-nvctk-cdi.service: Main process exited, code=exited, status=1/FAILURE
Apr 25 16:49:45 fedora systemd[1]: ublue-nvctk-cdi.service: Failed with result 'exit-code'.
Apr 25 16:49:45 fedora systemd[1]: Failed to start ublue-nvctk-cdi.service - ublue nvidia container toolkit CDI auto-generation.

And here is what journalctl -xeu returns for this service:

Apr 25 16:49:45 fedora systemd[1]: Starting ublue-nvctk-cdi.service - ublue nvidia container toolkit CDI auto-generation...
░░ Subject: A start job for unit ublue-nvctk-cdi.service has begun execution
░░ Defined-By: systemd
░░ Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
░░
░░ A start job for unit ublue-nvctk-cdi.service has begun execution.
░░
░░ The job identifier is 331.
Apr 25 16:49:45 fedora nvidia-ctk[5074]: time="2024-04-25T16:49:45+02:00" level=info msg="Auto-detected mode as \"nvml\""
Apr 25 16:49:45 fedora nvidia-ctk[5074]: time="2024-04-25T16:49:45+02:00" level=error msg="failed to generate CDI spec: failed to create device CDI specs: failed to initialize NVML: ERROR_DRIVER_NOT_LOADED"
Apr 25 16:49:45 fedora systemd[1]: ublue-nvctk-cdi.service: Main process exited, code=exited, status=1/FAILURE
░░ Subject: Unit process exited
░░ Defined-By: systemd
░░ Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
░░
░░ An ExecStart= process belonging to unit ublue-nvctk-cdi.service has exited.
░░
░░ The process' exit code is 'exited' and its exit status is 1.
Apr 25 16:49:45 fedora systemd[1]: ublue-nvctk-cdi.service: Failed with result 'exit-code'.
░░ Subject: Unit failed
░░ Defined-By: systemd
░░ Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
░░
░░ The unit ublue-nvctk-cdi.service has entered the 'failed' state with result 'exit-code'.
Apr 25 16:49:45 fedora systemd[1]: Failed to start ublue-nvctk-cdi.service - ublue nvidia container toolkit CDI auto-generation.
░░ Subject: A start job for unit ublue-nvctk-cdi.service has failed
░░ Defined-By: systemd
░░ Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
░░
░░ A start job for unit ublue-nvctk-cdi.service has finished with a failure.
░░
░░ The job identifier is 331 and the job result is failed.

As I mentioned on discord, I think disabling the dGPU shouldn't be a source of error, as this has a HUGE impact on battery life, and if I don't plan on doing something that requires the dGPU, I think it's best to just disable it until I need it. In this case, I think displaying warnings at most would be ideal.

@bsherman
Copy link
Contributor

@m2Giles and I were discussing this, and we can replace the service with a udev rule which calls the device gets added.

ACTION=="add", SUBSYSTEM=="pci", ATTR{vendor}=="0x10de", ATTR{class}=="0x030000", RUN{program}="/usr/bin/nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml"

Something like this?

@bsherman bsherman removed their assignment Oct 31, 2024
@bsherman bsherman changed the title ublue-nvctk-cdi.service runs always ublue-nvctk-cdi.service should be refactored to udev rule Oct 31, 2024
@Schmuuu
Copy link

Schmuuu commented Jan 23, 2025

Is there an ETA when this change will be visible in the ublue images? I convinced a family member to switch from Win10 to Aurora, but after installation on his old Laptop with nvidia GPU he only gets the service error message displayed on his screen and no desktop loads:
[FAILED] Failed to start ublue-nvctk-cdi.se...tainer toolkit CDI auto-generation
which seems to be error message
Failed to start ublue-nvctk-cdi.service - ublue nvidia container toolkit CDI auto-generation.
just reduced to 80 characters.

I guess there is no fix, which one could manually apply, right? It needs to be an update Aurora image, correct?

@Sharkitty
Copy link

Is there an ETA when this change will be visible in the ublue images? I convinced a family member to switch from Win10 to Aurora, but after installation on his old Laptop with nvidia GPU he only gets the service error message displayed on his screen and no desktop loads: [FAILED] Failed to start ublue-nvctk-cdi.se...tainer toolkit CDI auto-generation which seems to be error message Failed to start ublue-nvctk-cdi.service - ublue nvidia container toolkit CDI auto-generation. just reduced to 80 characters.

I guess there is no fix, which one could manually apply, right? It needs to be an update Aurora image, correct?

From what I see, while your issue seems related to some extend, I think something else is going on in your case. This error should not prevent display as it occurs in two cases:

  • PC with hybrid graphics, where the dGPU is disabled. In this case the iGPU is still functional.
  • Use of eGPU, which as described earlier is not available when the service start. There should be another GPU to render the desktop in this case as well.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
bug Something isn't working
Projects
Status: Todo
Development

No branches or pull requests

4 participants