Skip to content

higher CV_PAUSE cost on skylake #22852

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Closed
3 of 4 tasks
vrabaud opened this issue Nov 23, 2022 · 8 comments
Closed
3 of 4 tasks

higher CV_PAUSE cost on skylake #22852

vrabaud opened this issue Nov 23, 2022 · 8 comments

Comments

@vrabaud
Copy link
Contributor

vrabaud commented Nov 23, 2022

System Information

OpenCV version: 4.6.0
Operating System / Platform: Custom Linux
Compiler & compiler version: Custom clang

Detailed description

On Intel architectures, CV_PAUSE is implemented with __mm_pause:

# define CV_PAUSE(v) do { for (int __delay = (v); __delay > 0; --__delay) { _mm_pause(); } } while (0)

But it is called with the same number of loops independently from the architecture:

And the cost of __mm_pause went from 5 micro-ops on Haswell to 140 on Skylake thus creating more CPU consumption from the Threadpool on Skylake.

This is documented (as well as a workaround) here: https://www.intel.com/content/www/us/en/developer/articles/technical/a-common-construct-to-avoid-the-contention-of-threads-architecture-agnostic-spin-wait-loops.html

Steps to reproduce

Profiling any multi-threaded code on Haswell and then Skylake.

Issue submission checklist

  • I report the issue, it's not a question
  • I checked the problem with documentation, FAQ, open issues, forum.opencv.org, Stack Overflow, etc and have not found any solution
  • I updated to the latest OpenCV version and the issue is still there
  • There is reproducer code and related data files (videos, images, onnx, etc)
@vrabaud vrabaud added the bug label Nov 23, 2022
@kallaballa
Copy link
Contributor

Hi! I am a contributor. From what I read my Tigerlake should be affected too? If so, I should see a considerable performance improvement on multi-threaded code (with sufficient lock contention) if i change it to CV_PAUSE(1);?

@kallaballa
Copy link
Contributor

Also, could you provide a minimal program (reproducer) that exhibits that behaviour?

@vrabaud
Copy link
Contributor Author

vrabaud commented Nov 23, 2022

Anything after Skylake should be affected from what I have found.

A reproducer is tough: the cost of computation in a thread is usually higher (except if you have more than 100 cores) and if you want to benchmark with callgrind, it will use SIMD emulation which might not work with __mm_pause.

I could only find that "bug" thanks to private tooling and scale.

@kallaballa
Copy link
Contributor

kallaballa commented Nov 23, 2022

I changed CV_PAUSE(16); to CV_PAUSE(1); but fail to produce a program that shows signs of improvement. At the moment I simply measure execution times (actually FPS in my case) but also profiling with perf (through hotspot) doesn't show a significant difference.
Still investigating.

Anyway, cool find!

@kallaballa
Copy link
Contributor

Anything after Skylake should be affected from what I have found.

A reproducer is tough: the cost of computation in a thread is usually higher (except if you have more than 100 cores) and if you want to benchmark with callgrind, it will use SIMD emulation which might not work with __mm_pause.

I could only find that "bug" thanks to private tooling and scale.

I understand your argument of scale but still I would like to put the difference to effect somehow. btw. why would you profile with callgrind?

@vrabaud
Copy link
Contributor Author

vrabaud commented Nov 24, 2022

For callgrind, I was just suggesting the CPU profiler I usually use for open source work.

@alalek
Copy link
Member

alalek commented Nov 24, 2022

@vrabaud Feel free to prepare PR (which is validated on your side).

@vrabaud
Copy link
Contributor Author

vrabaud commented Nov 24, 2022

I'll get back to you with a pull request once I can validate the gain. For now, I am testing CV_PAUSE as follows:

// The delay fits the old behavior where __mm_pause took 5 cycles.
#   define CV_PAUSE(v) do { const uint64_t __delay = 5 * v; uint64_t __init = __rdtsc(); do { _mm_pause(); } while ((__rdtsc() - __init) < __delay); } while (0)

vrabaud added a commit to vrabaud/opencv that referenced this issue Dec 15, 2022
vrabaud added a commit to vrabaud/opencv that referenced this issue Dec 15, 2022
@vrabaud vrabaud closed this as completed Dec 16, 2022
# for free to join this conversation on GitHub. Already have an account? # to comment
Projects
None yet
Development

No branches or pull requests

3 participants