-
Notifications
You must be signed in to change notification settings - Fork 768
[SYCL][L0][CUDA][HIP] Fix PI_KERNEL_GROUP_INFO_GLOBAL_WORK_SIZE queries #8769
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Conversation
I have a question. Is the max global work size independent of the global work size set in a host program for a kernel ? |
/verify with intel/llvm-test-suite#1694 |
@abagusetty, FYI. "verify with" command do not validate on CUDA/HIP platforms. |
Thanks, I stumbled upon that too and looked at the wording in Spec, which made me think it could be the max global limits.
|
The global work sizes from the query will be the same for any kernels. Right ? |
Yes, since the descriptor is a kernel_device_specific one: Any kernel from (custom device type or a built-in kernel) possibly returns the info of device specific global-work-sizes which in turn should be the same for all the kernels IMO. |
…m device-types appropriately
sycl/plugins/cuda/pi_cuda.hpp
Outdated
@@ -42,6 +42,11 @@ | |||
#include <unordered_map> | |||
#include <vector> | |||
|
|||
// Helper for one-liner validation | |||
#define PI_ASSERT(condition, error) \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a bit misleading, as it does not assert on the condition, maybe consider renaming it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PI_ASSERT
to PI_ERR_CHECK
Gentle ping @smaslov-intel @jchlanda |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 on L0 changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay. I think these changes look good. I am a little curious what built-in kernels they would apply to, but I assume CUDA, HIP and L0 guarantee full possible work-sizes either way.
Thanks for the feed back on the built-ins, I too stumbled upon that a bit: Just convinced myself that they see the complete device limits. |
intel#8769 Signed-off-by: Jaime Arteaga <jaime.a.arteaga.molina@intel.com>
intel#8769 Signed-off-by: Jaime Arteaga <jaime.a.arteaga.molina@intel.com>
intel#8769 Signed-off-by: Jaime Arteaga <jaime.a.arteaga.molina@intel.com>
intel#8769 Signed-off-by: Jaime Arteaga <jaime.a.arteaga.molina@intel.com>
intel#8769 Signed-off-by: Jaime Arteaga <jaime.a.arteaga.molina@intel.com>
intel#8769 Signed-off-by: Jaime Arteaga <jaime.a.arteaga.molina@intel.com>
Address kernel query
global_work_size
for L0, CUDA, HIP fromPI_KERNEL_GROUP_INFO_GLOBAL_WORK_SIZE
Fixes #8766
For instance (for X-dimension)
L0:
maxGroupSizeX * maxGroupCountX
CUDA:
CU_DEVICE_ATTRIBUTE_MAX_BLOCK_DIM_X * CU_DEVICE_ATTRIBUTE_MAX_GRID_DIM_X