-
-
Notifications
You must be signed in to change notification settings - Fork 56.2k
higher CV_PAUSE cost on skylake #22852
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Comments
Hi! I am a contributor. From what I read my Tigerlake should be affected too? If so, I should see a considerable performance improvement on multi-threaded code (with sufficient lock contention) if i change it to |
Also, could you provide a minimal program (reproducer) that exhibits that behaviour? |
Anything after Skylake should be affected from what I have found. A reproducer is tough: the cost of computation in a thread is usually higher (except if you have more than 100 cores) and if you want to benchmark with callgrind, it will use SIMD emulation which might not work with __mm_pause. I could only find that "bug" thanks to private tooling and scale. |
I changed Anyway, cool find! |
I understand your argument of scale but still I would like to put the difference to effect somehow. btw. why would you profile with callgrind? |
For callgrind, I was just suggesting the CPU profiler I usually use for open source work. |
@vrabaud Feel free to prepare PR (which is validated on your side). |
I'll get back to you with a pull request once I can validate the gain. For now, I am testing CV_PAUSE as follows:
|
This is fixing opencv#22852
This is fixing opencv#22852
System Information
OpenCV version: 4.6.0
Operating System / Platform: Custom Linux
Compiler & compiler version: Custom clang
Detailed description
On Intel architectures, CV_PAUSE is implemented with
__mm_pause
:opencv/modules/core/src/parallel_impl.cpp
Line 47 in 6ca205a
But it is called with the same number of loops independently from the architecture:
opencv/modules/core/src/parallel_impl.cpp
Line 393 in 6ca205a
And the cost of
__mm_pause
went from 5 micro-ops on Haswell to 140 on Skylake thus creating more CPU consumption from the Threadpool on Skylake.This is documented (as well as a workaround) here: https://www.intel.com/content/www/us/en/developer/articles/technical/a-common-construct-to-avoid-the-contention-of-threads-architecture-agnostic-spin-wait-loops.html
Steps to reproduce
Profiling any multi-threaded code on Haswell and then Skylake.
Issue submission checklist
The text was updated successfully, but these errors were encountered: