Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Add code change workaround for 64-bit reduce_by_segment bug #1791

Closed

Conversation

mmichel11
Copy link
Contributor

@mmichel11 mmichel11 commented Aug 22, 2024

There is an IGC bug that affects reduce_by_segment with 64-bit types on GPU Series Max devices which has previously required us to provide the ONEDPL_WORKAROUND_FOR_IGPU_64BIT_REDUCTION macro workaround. This workaround invokes the legacy implementation which is around ~3x slower but produces correct results.

The IGC bug still exists, but I have a found a workaround with negligible performance impact within our reduce_by_segment implementation. This enables users to invoke the faster reduce_by_segment implementation without correctness issues.

By first initializing the private memory arrays to the known identity element prior to loading real data into some of the array indices, the register filling bug is avoided. I have verified with oneDPL tests (which previously caught this issue) and with external tests.

I have also removed the macro workaround and additional test.

I've collected information on the performance impact which is negligible. Feel free to request if you would like to see it.

Filling the SYCL private memory array with the identity prior to loading data works around the encountered IGC bug.
No real performance impact can be measured with this change. The current macro workaround is also removed

Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
@mmichel11 mmichel11 added the bug label Aug 22, 2024
// TODO: Remove this initialization to the identity when possible. We load real data to __loc_partials
// in the first loop below but this initialization to the identity works around an IGC register
// filling bug.
std::array<__val_type, __vals_per_item> __loc_partials = {__identity};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this meant to fill the array with the identity value? Because I believe as it is currently written, only the first value in the array would be populated and the rest will be uninitialized. If the intent is for all of the elements to be the identity, then this can be written as:

Suggested change
std::array<__val_type, __vals_per_item> __loc_partials = {__identity};
std::array<__val_type, __vals_per_item> __loc_partials;
std::fill(__loc_partials.begin(), __loc_partials.end(), __identity);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, it looks like the rest of the elements may be initialized to 0: https://en.cppreference.com/w/c/language/array_initialization.

The way the fix was implemented still worked since it does not matter what is loaded into the array as long as it's something. However, I switched to your suggestion to be consistent.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filling the array after its definition seems to reintroduce the bug. I will see if I can find a better solution. I suppose what we originally had adds a default constructability requirement we do not want.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have reverted the change I made here. My last statement is wrong, the default constructor for each array element is already called when we declare the array, so we are not adding any additional requirements.

I have explored some different ways to try to workaround the issue, but this seems to be the only thing that works. I've confirmed that register filling bug is avoided as our tests pass along with internal reproducers where the issue was reported.

Signed-off-by: Matthew Michel <matthew.michel@intel.com>
@@ -351,7 +351,12 @@ __sycl_reduce_by_segment(__internal::__hetero_tag<_BackendTag>, _ExecutionPolicy
__seg_reduce_wg_kernel,
#endif
sycl::nd_range<1>{__n_groups * __wgroup_size, __wgroup_size}, [=](sycl::nd_item<1> __item) {
::std::array<__val_type, __vals_per_item> __loc_partials;
auto __identity = unseq_backend::__known_identity<_BinaryOperator, __val_type>;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use __val_type instead of auto - it's will more readable, I think.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

…irst"

I decided to keep this solution as it does not add any additional requirements as I first thought. It looks strange,
as the first element of the array is set to the identity while the others are default constructed. However, this
results in IGC avoiding its own bug internally.

This reverts commit 3e08d9d.
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
@mmichel11
Copy link
Contributor Author

Thank you for the reviews so far in this PR. I am closing it in favor of #1915. The entire reduce_by_segment SYCL-based implementation is replaced with a reduce-then-scan call and is unaffected by this device compilation issue while also bringing significant performance improvement.

If for any reason we still need this fix, I can re-open in the future.

@mmichel11 mmichel11 closed this Oct 23, 2024
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants