-
Notifications
You must be signed in to change notification settings - Fork 769
[SYCL] Improve performance of generic shuffles #3815
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Conversation
Previous upper bound considered only the offset, allowing a memcpy for the final chunk to walk off the end of the byte array. Signed-off-by: John Pennycook <john.pennycook@intel.com>
sycl::detail::memcpy is implemented as a loop, resulting in different optimizations than std::memcpy. Signed-off-by: John Pennycook <john.pennycook@intel.com>
Opening this for review now that #3879 has been merged. |
@AlexeySachkov ping. Also, is there anybody else you think should review this? |
@@ -153,9 +153,9 @@ EnableIfGenericBroadcast<T, IdT> GroupBroadcast(T x, IdT local_id) { | |||
char *ResultBytes = reinterpret_cast<char *>(&Result); | |||
auto BroadcastBytes = [=](size_t Offset, size_t Size) { | |||
uint64_t BroadcastX, BroadcastResult; | |||
detail::memcpy(&BroadcastX, XBytes + Offset, Size); | |||
std::memcpy(&BroadcastX, XBytes + Offset, Size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It look like this breaks the tests due to missing #include <cstring>
. See pre-commit failures for #4153.
@Pennycook, could you fix this ASAP, please?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry about that. Opened #4157 to (hopefully) fix it.
The changes in #3815 introduced calls to std::memcpy without including <cstring>, causing some tests to fail. Signed-off-by: John Pennycook <john.pennycook@intel.com>
…ackend_plugin * upstream/sycl: (26 commits) [SPIR-V][NFC] Move non-upstreamed FuncParam decorations into internal:: (intel#4138) [SYCL] Move free function queries to experimental namespace (intel#4090) [SYCL][XPTI] Enable PI calls notifications with arguments (intel#4148) [SYCL] Revert queue::wait() to its old behaviour with Level Zero (intel#4153) [SYCL] Add missing <cstring> header to spirv.hpp (intel#4157) [SYCL] Adds info query for atomic_memory_order_capabilities on device and context (intel#4105) [SYCL] Improve performance of generic shuffles (intel#3815) [SYCL] Fix the error with namespaces caused during rebase of intel#4014 (intel#4151) [ESIMD] Fix 'ambiguous operator' error with length 1 simd operands (intel#4149) [libdevice][NFC] Fix libdevice dependencies list (intel#4130) [SPIR-V] Reland Encode debug info producer in SPIR-V (intel#4082) [SYCL][ROCm] Add ROCm support to get_device_count_by_type (intel#4113) [SYCL] Fix sRGB device info (intel#4145) [SYCL][ROCm] Fix kernel launch with multiple dimensions (intel#4063) [SYCL][ROCm] Fix compilation for AMD GPU with -fsycl-dead-args-optimization (intel#4126) [SYCL][Level Zero] Enable multi-CCS support. (intel#4038) [SYCL] Pass bound arch to unbundler (intel#4112) [ESIMD][doc] Added documentation for some ESIMD math APIs (intel#3995) [ESIMD] rename gather4/scatter4 to gather_rgba/scatter_rgba (intel#4120) [SYCL][NFC] Remove unused variable. (intel#4131) ...
The previous upper bound of the loop in GenericCall considered only the offset,
allowing a memcpy for the final chunk to walk off the end of the byte array.
sycl::detail::memcpy is implemented as a loop, resulting in different
optimizations than std::memcpy.
These two changes improve the performance of generic shuffles by approximately
2x for common use-cases, such as a struct of 3 floats.
Signed-off-by: John Pennycook john.pennycook@intel.com