Replies: 1 comment
-
It is also stated that the programs are “launch[ed] in an order that promotes data reuse”. The execution order of thread blocks is undefined, unless co-scheduling in thread block clusters is used on an NVIDIA GPU. A maximum of 8 thread blocks is allowed, which appears to be related to the GROUP_SIZE_M: 8 setting. Does a group of programs relate to a thread block cluster? Are GROUP_SIZE_M programs mapped to a thread block cluster? |
Beta Was this translation helpful? Give feedback.
0 replies
# for free
to join this conversation on GitHub.
Already have an account?
# to comment
-
My understanding is that the consecutive program ids are mapped to 2D program ids such that:
A more efficient use of the L2 cache line would result in fewer misses and higher TFLOPS.
Is the 10% statement based on experiment(s) where the automatically managed shared memory was controlled, and the improvement was specifically attributable to L2 cache?
How do you design an informative baseline for comparing kernels when the shared memory is automatically managed?
Beta Was this translation helpful? Give feedback.
All reactions