-
Notifications
You must be signed in to change notification settings - Fork 245
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
clpeak and llama.cpp stuck at 100% CPU on 6.8.5 kernel #726
Comments
There's a i915 kernel driver regression on CCS (Compute Command Streamer) engine of DG2/Arc GPU. https://gitlab.freedesktop.org/drm/intel/-/issues/10895 The faulty kernel commit in 6.6.26+ and 6.8.5+ is https://lore.kernel.org/stable/20240327155622.538140-4-andi.shyti@linux.intel.com/T/ And both Media-Driver and OpenCL are affected. @JablonskiMateusz Could you please let the kernel developer aware of this regression? |
This one seems to be happening with any kernel that has Spectre BHI fix in it. Happens with LTS kernels as well. Kernels with this issue:
Kernels that works fine:
Haven't tried kernels older than 6.6 LTS.
|
To determine if the same issue caused this regression, you can try my custom Arch Linux kernel package with the CCS changes reverted: https://github.com/gnattu/linux/releases/tag/6.8.7-jelly If this kernel fixes the issue, then it's likely related to the CCS change as well. |
That custom kernel works fine. So it is CCS related. |
An Intel developer has provided me with a series of patches to test the theory of a potential fix without a full revert. I have created an Arch Linux package for it, with the related patches attached: https://github.com/gnattu/linux/releases/tag/6.8.7-intel-ccs_mode-4 Please be aware that this kernel is not a guaranteed fix and is only for testing purposes. If anyone has the time, please try this kernel and let me know if it fixes the issue or not. |
This one doesn't work. But it has "gray / io-wait" CPU usage instead of "red / kernel" CPU usage on htop. |
Just got the latest patch from Intel developers: https://github.com/gnattu/linux/releases/tag/6.8.7-intel-set-ccs-mode-on-reset Intel developers already reproduced our issue with clpeak, so this kernel has a higher chance of fixing our issue. |
This works but performance is 1/4 of what it should be. |
I can confirm that the applications no longer hang and got fixed with a kernel built with the new patches on drm-intel-gt-next but compute performance seems to be lowered on my system as well. Not sure if there is something going on with the compute runtime and these new changes or if it is out of scope for this discussion here. |
Can you try this kernel build to see if the performance improves? https://github.com/gnattu/linux/releases/tag/6.9.1-jelly |
This works fine and performance is pretty much the same as 6.8.4. 6.8.4 / 6.9.1-custom: |
Can confirm that this is an improvement. I don't know how to run a benchmark to check memcopy but just comparing to a pretty intense blender file I have lying around, there's a noticeable improvement in the viewport |
Now, with kernel 6.8.10 in fedora 40, it works, though performance is not great. I'm closing this. Thanks for everyone! |
@notsyncing Full performance needs the patch mentioned above, drm/i915/gt: Fix CCS id's calculation for CCS mode setting but it has only been merged in kernel 6.10-rc2 so it will take a while to get to Fedora. |
I'm on kernel 6.9 and performance is back up to pre 6.8 levels. Would love even more performance, of course, but I'll be surprised if there is another bump after 6.10. |
I don't see any change with kernel 6.10 on OpenCL-Benchmark, FWIW. |
As #710, @Disty0 writes:
This happens to me as well on both llama.cpp and clpeak.
clpeak
output:Then it stuck here and
clpeak
process consumes one cpu core (100% usage).perf record -a
when it stuck reports:System information:
The text was updated successfully, but these errors were encountered: