Try to use the heuristic's block configuration when using grid-stride kernels. #372

maleadt · 2021-08-12T10:58:02Z

#367 resulted in a broadcast performance regression with specially-sized inputs, as noticed by @crstnbr in https://gist.github.com/crstnbr/6cbc3d2c32422472a579bf61b5d62015. Turns out these 'slow' cases were due to launching 49 blocks, while the 'fast' case only needed 48 (since we're dealing with a grid-stride kernel here):

heuristic = (blocks = 48, threads = 1024)

config = (threads = 1024, blocks = 49, elements_per_thread = 170)
saxpy (julia): n=      8388608  50.855 GFLOP/s 305.127 GB/s   0.000 s

config = (threads = 1024, blocks = 48, elements_per_thread = 192)
saxpy (julia): n=      9437184  61.111 GFLOP/s 366.663 GB/s   0.000 s

config = (threads = 1024, blocks = 49, elements_per_thread = 213)
saxpy (julia): n=     10485760  50.886 GFLOP/s 305.318 GB/s   0.000 s

This PR corrects that by trying to stick to the block count as suggested by the heuristic, and padding the number of elements instead of vice versa:

config = (threads = 1024, blocks = 48, elements_per_thread = 171)
saxpy (julia): n=      8388608  60.902 GFLOP/s 365.415 GB/s   0.000 s

config = (threads = 1024, blocks = 48, elements_per_thread = 192)
saxpy (julia): n=      9437184  61.091 GFLOP/s 366.545 GB/s   0.000 s

config = (threads = 1024, blocks = 48, elements_per_thread = 214)
saxpy (julia): n=     10485760  61.151 GFLOP/s 366.907 GB/s   0.000 s

Notice the 171 elements per thread instead of 170, for the first iteration.

Curiously, removing the per-thread iteration altogether speeds up everything even further:

heuristic = (blocks = 48, threads = 1024)

config = (threads = 1024, blocks = 8192, elements_per_thread = 1)
saxpy (julia): n=      8388608  61.532 GFLOP/s 369.194 GB/s   0.000 s

config = (threads = 1024, blocks = 9216, elements_per_thread = 1)
saxpy (julia): n=      9437184  61.714 GFLOP/s 370.284 GB/s   0.000 s

config = (threads = 1024, blocks = 10240, elements_per_thread = 1)
saxpy (julia): n=     10485760  61.964 GFLOP/s 371.785 GB/s   0.000 s

I'm assuming that the 48/49 slowdown was due to unbalanced SM load, but it's weird that a totally unbounded block count performs better than the occupancy API-suggested number of 48 blocks... Since we might be introducing more grid-stride kernels, I'm inclined to leave it in for now.

… kernels.

Try to use the heuristic's block configuration when using grid-stride…

20f1a38

… kernels.

maleadt merged commit 5d9a108 into master Aug 13, 2021

bors bot deleted the tb/block_heuristic branch August 13, 2021 12:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Try to use the heuristic's block configuration when using grid-stride kernels. #372

Try to use the heuristic's block configuration when using grid-stride kernels. #372

maleadt commented Aug 12, 2021

Try to use the heuristic's block configuration when using grid-stride kernels. #372

Try to use the heuristic's block configuration when using grid-stride kernels. #372

Conversation

maleadt commented Aug 12, 2021