Skip to content

[SYCL] Improve performance of generic shuffles #3815

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Merged
merged 2 commits into from
Jul 21, 2021

Conversation

Pennycook
Copy link
Contributor

The previous upper bound of the loop in GenericCall considered only the offset,
allowing a memcpy for the final chunk to walk off the end of the byte array.

sycl::detail::memcpy is implemented as a loop, resulting in different
optimizations than std::memcpy.

These two changes improve the performance of generic shuffles by approximately
2x for common use-cases, such as a struct of 3 floats.

Signed-off-by: John Pennycook john.pennycook@intel.com

Pennycook added 2 commits May 25, 2021 13:23
Previous upper bound considered only the offset, allowing a memcpy
for the final chunk to walk off the end of the byte array.

Signed-off-by: John Pennycook <john.pennycook@intel.com>
sycl::detail::memcpy is implemented as a loop, resulting in different
optimizations than std::memcpy.

Signed-off-by: John Pennycook <john.pennycook@intel.com>
@Pennycook Pennycook added enhancement New feature or request performance Performance related issues labels May 25, 2021
@Pennycook Pennycook marked this pull request as ready for review July 8, 2021 13:24
@Pennycook Pennycook requested a review from AlexeySachkov as a code owner July 8, 2021 13:24
@Pennycook
Copy link
Contributor Author

Opening this for review now that #3879 has been merged.

@Pennycook
Copy link
Contributor Author

@AlexeySachkov ping. Also, is there anybody else you think should review this?

@bader bader merged commit fb08adf into intel:sycl Jul 21, 2021
@@ -153,9 +153,9 @@ EnableIfGenericBroadcast<T, IdT> GroupBroadcast(T x, IdT local_id) {
char *ResultBytes = reinterpret_cast<char *>(&Result);
auto BroadcastBytes = [=](size_t Offset, size_t Size) {
uint64_t BroadcastX, BroadcastResult;
detail::memcpy(&BroadcastX, XBytes + Offset, Size);
std::memcpy(&BroadcastX, XBytes + Offset, Size);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It look like this breaks the tests due to missing #include <cstring>. See pre-commit failures for #4153.
@Pennycook, could you fix this ASAP, please?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry about that. Opened #4157 to (hopefully) fix it.

bader pushed a commit that referenced this pull request Jul 21, 2021
The changes in #3815 introduced calls to std::memcpy without including
<cstring>, causing some tests to fail.

Signed-off-by: John Pennycook <john.pennycook@intel.com>
alexbatashev added a commit to alexbatashev/llvm that referenced this pull request Jul 22, 2021
…ackend_plugin

* upstream/sycl: (26 commits)
  [SPIR-V][NFC] Move non-upstreamed FuncParam decorations into internal:: (intel#4138)
  [SYCL] Move free function queries to experimental namespace (intel#4090)
  [SYCL][XPTI] Enable PI calls notifications with arguments (intel#4148)
  [SYCL] Revert queue::wait() to its old behaviour with Level Zero (intel#4153)
  [SYCL] Add missing <cstring> header to spirv.hpp (intel#4157)
  [SYCL] Adds info query for atomic_memory_order_capabilities on device and context (intel#4105)
  [SYCL] Improve performance of generic shuffles (intel#3815)
  [SYCL] Fix the error with namespaces caused during rebase of intel#4014 (intel#4151)
  [ESIMD] Fix 'ambiguous operator' error with length 1 simd operands (intel#4149)
  [libdevice][NFC] Fix libdevice dependencies list (intel#4130)
  [SPIR-V] Reland Encode debug info producer in SPIR-V (intel#4082)
  [SYCL][ROCm] Add ROCm support to get_device_count_by_type (intel#4113)
  [SYCL] Fix sRGB device info (intel#4145)
  [SYCL][ROCm] Fix kernel launch with multiple dimensions (intel#4063)
  [SYCL][ROCm] Fix compilation for AMD GPU with -fsycl-dead-args-optimization (intel#4126)
  [SYCL][Level Zero] Enable multi-CCS support. (intel#4038)
  [SYCL] Pass bound arch to unbundler (intel#4112)
  [ESIMD][doc] Added documentation for some ESIMD math APIs (intel#3995)
  [ESIMD] rename gather4/scatter4 to gather_rgba/scatter_rgba (intel#4120)
  [SYCL][NFC] Remove unused variable. (intel#4131)
  ...
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
enhancement New feature or request performance Performance related issues
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants