Skip to content

Try to use the heuristic's block configuration when using grid-stride kernels. #372

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Merged
merged 1 commit into from
Aug 13, 2021

Conversation

maleadt
Copy link
Member

@maleadt maleadt commented Aug 12, 2021

#367 resulted in a broadcast performance regression with specially-sized inputs, as noticed by @crstnbr in https://gist.github.com/crstnbr/6cbc3d2c32422472a579bf61b5d62015. Turns out these 'slow' cases were due to launching 49 blocks, while the 'fast' case only needed 48 (since we're dealing with a grid-stride kernel here):

heuristic = (blocks = 48, threads = 1024)

config = (threads = 1024, blocks = 49, elements_per_thread = 170)
saxpy (julia): n=      8388608  50.855 GFLOP/s 305.127 GB/s   0.000 s

config = (threads = 1024, blocks = 48, elements_per_thread = 192)
saxpy (julia): n=      9437184  61.111 GFLOP/s 366.663 GB/s   0.000 s

config = (threads = 1024, blocks = 49, elements_per_thread = 213)
saxpy (julia): n=     10485760  50.886 GFLOP/s 305.318 GB/s   0.000 s

This PR corrects that by trying to stick to the block count as suggested by the heuristic, and padding the number of elements instead of vice versa:

config = (threads = 1024, blocks = 48, elements_per_thread = 171)
saxpy (julia): n=      8388608  60.902 GFLOP/s 365.415 GB/s   0.000 s

config = (threads = 1024, blocks = 48, elements_per_thread = 192)
saxpy (julia): n=      9437184  61.091 GFLOP/s 366.545 GB/s   0.000 s

config = (threads = 1024, blocks = 48, elements_per_thread = 214)
saxpy (julia): n=     10485760  61.151 GFLOP/s 366.907 GB/s   0.000 s

Notice the 171 elements per thread instead of 170, for the first iteration.

Curiously, removing the per-thread iteration altogether speeds up everything even further:

heuristic = (blocks = 48, threads = 1024)

config = (threads = 1024, blocks = 8192, elements_per_thread = 1)
saxpy (julia): n=      8388608  61.532 GFLOP/s 369.194 GB/s   0.000 s

config = (threads = 1024, blocks = 9216, elements_per_thread = 1)
saxpy (julia): n=      9437184  61.714 GFLOP/s 370.284 GB/s   0.000 s

config = (threads = 1024, blocks = 10240, elements_per_thread = 1)
saxpy (julia): n=     10485760  61.964 GFLOP/s 371.785 GB/s   0.000 s

I'm assuming that the 48/49 slowdown was due to unbalanced SM load, but it's weird that a totally unbounded block count performs better than the occupancy API-suggested number of 48 blocks... Since we might be introducing more grid-stride kernels, I'm inclined to leave it in for now.

@maleadt maleadt merged commit 5d9a108 into master Aug 13, 2021
@bors bors bot deleted the tb/block_heuristic branch August 13, 2021 12:01
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant