Baseline for L2 cache optimization in matrix multiplication #5993

alfin3 · 2025-02-22T19:54:51Z

alfin3
Feb 22, 2025

My understanding is that the consecutive program ids are mapped to 2D program ids such that:

the segments of the B matrix along the K dimension that are non-contiguous in the global GPU memory are more likely to be kept and reused in the L2 cache, and
the segments of the A matrix along the K dimension that are contiguous in the global GPU memory are more likely to be repeatedly fetched into the L2 cache.

A more efficient use of the L2 cache line would result in fewer misses and higher TFLOPS.

Is the 10% statement based on experiment(s) where the automatically managed shared memory was controlled, and the improvement was specifically attributable to L2 cache?

How do you design an informative baseline for comparing kernels when the shared memory is automatically managed?

alfin3 · 2025-02-23T21:59:27Z

alfin3
Feb 23, 2025
Author

It is also stated that the programs are “launch[ed] in an order that promotes data reuse”. The execution order of thread blocks is undefined, unless co-scheduling in thread block clusters is used on an NVIDIA GPU. A maximum of 8 thread blocks is allowed, which appears to be related to the GROUP_SIZE_M: 8 setting.

Does a group of programs relate to a thread block cluster? Are GROUP_SIZE_M programs mapped to a thread block cluster?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Baseline for L2 cache optimization in matrix multiplication #5993

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Baseline for L2 cache optimization in matrix multiplication #5993

alfin3 Feb 22, 2025

Replies: 1 comment

alfin3 Feb 23, 2025 Author

alfin3
Feb 22, 2025

alfin3
Feb 23, 2025
Author