-
Notifications
You must be signed in to change notification settings - Fork 756
add tests for scan_by_key in-place execution #1696
add tests for scan_by_key in-place execution #1696
Conversation
Summarizing some offline discussions and testing:
@senior-zero will be investigating this a bit more before we make a decision here. |
I've managed to fuse some kernels to reduce slowdown up to 6% on small to moderate problem sizes. The performance regression disappears at about |
6% for small inputs and no impact on larger inputs seems worthwhile to restore the documented guarantees. IMO we should update the implementation. |
b5126d6
to
aa55ec2
Compare
@allisonvacanti I've bumped CUB version to one that incorporates updated scan by key. I'd like to re purpose this PR to update tests in thrust. Please, review the changes. |
run tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's also add tests to the thrust/testing/scan_by_key.*.cu
files to make sure the non-CUDA backends test this, too.
aa55ec2
to
380870d
Compare
run tests |
// in-place scans: keys/values aliasing | ||
thrust::inclusive_scan_by_key(h_keys.begin(), h_keys.end(), h_vals.begin(), h_output.begin()); | ||
inclusive_scan_by_key_kernel<<<1,1>>>(exec, d_keys.begin(), d_keys.end(), d_vals.begin(), d_keys.begin()); | ||
{ | ||
cudaError_t const err = cudaDeviceSynchronize(); | ||
ASSERT_EQUAL(cudaSuccess, err); | ||
} | ||
ASSERT_EQUAL(d_keys, h_output); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpick, no change requested: I would prefer if we put the whole test in its own block rather than only the check, as that helps readability a lot
// in-place scans: keys/values aliasing | |
thrust::inclusive_scan_by_key(h_keys.begin(), h_keys.end(), h_vals.begin(), h_output.begin()); | |
inclusive_scan_by_key_kernel<<<1,1>>>(exec, d_keys.begin(), d_keys.end(), d_vals.begin(), d_keys.begin()); | |
{ | |
cudaError_t const err = cudaDeviceSynchronize(); | |
ASSERT_EQUAL(cudaSuccess, err); | |
} | |
ASSERT_EQUAL(d_keys, h_output); | |
{ // in-place scans: keys/values aliasing | |
thrust::inclusive_scan_by_key(h_keys.begin(), h_keys.end(), h_vals.begin(), h_output.begin()); | |
inclusive_scan_by_key_kernel<<<1,1>>>(exec, d_keys.begin(), d_keys.end(), d_vals.begin(), d_keys.begin()); | |
cudaError_t const err = cudaDeviceSynchronize(); | |
ASSERT_EQUAL(cudaSuccess, err); | |
ASSERT_EQUAL(d_keys, h_output); | |
} |
Our scan by key implementation doesn't allow keys/result aliasing. This is caused by non-synchronized reading of
d_keys_in[tile_base - 1]
in the following code. Therefore, aliasing of keys and results introduces a data race. The following code can be used as a reproducer:We also don't have tests for this use case. I can fix the code to guarantee this behaviour (just like in adjacent difference), but it'll slow down non-in-place execution in some cases. My suggestion is to have a quick fix of documentation.