Try to use the heuristic's block configuration when using grid-stride kernels. #372
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
#367 resulted in a broadcast performance regression with specially-sized inputs, as noticed by @crstnbr in https://gist.github.com/crstnbr/6cbc3d2c32422472a579bf61b5d62015. Turns out these 'slow' cases were due to launching 49 blocks, while the 'fast' case only needed 48 (since we're dealing with a grid-stride kernel here):
This PR corrects that by trying to stick to the block count as suggested by the heuristic, and padding the number of elements instead of vice versa:
Notice the 171 elements per thread instead of 170, for the first iteration.
Curiously, removing the per-thread iteration altogether speeds up everything even further:
I'm assuming that the 48/49 slowdown was due to unbalanced SM load, but it's weird that a totally unbounded block count performs better than the occupancy API-suggested number of 48 blocks... Since we might be introducing more grid-stride kernels, I'm inclined to leave it in for now.