Allow separate sub-DAG for load and compute warp groups with warp-specialized circular buffering. #3941

rdspring1 · 2025-02-21T19:51:14Z

Why? Support persistent and ping-pong matmul kernels.

Context:
Circular-buffer for-loop is the first serial for-loop to the left of computeAt position. It is applied to the load cacheAfter TensorViews. Persistent scheduling with matmul creates multiple serial for-loops. There is a grid-stride for-loop over output-tiles and cta-k for-loop.

Scheduling Proposal:

Merge the output-tile and cta-k iterDomains.
Apply circular buffering to the load cacheAfter TensorView.
wgmma consumer would have separate output-tile and cta-k iterDomains.

Why? The load cacheAfter TensorViews will have a single serial iterDomain for circular buffering. This matches current circular buffering implementation.

Problem: The output-tile and cta-k iterDomains cannot be merged for compute warp-groups because storing the matmul results to global memory does not have cta-k iterDomain. Therefore, the wgmma consumer cannot be inlined with cacheAfter tma load, breaking the current circular buffering implementation.

Does consumer of circular buffering inputs need to be inlined?
This restriction seems unnecessary for warp-specialized circular buffering.

Lowering Proposal: For warp-specialized circular buffering, track separate for-loops for load and compute warp groups.
Restriction: The compute for-loop must be derived from load for-loop.

Pseudo-code:

mbarrier init
if (tma-load) {
  decrease_register_limit(40);
  for (output-tile) {
    for (cta-k) {
      if (elect-sync) {
        mbarrier wait for empty stage
        mbarrier arriveExpectTx for tma load
        tma load operand A and B cta tiles for stage
      }
    }
  }
} else { compute warp-group
  increase_register_limit(232);
  mbarrier arrive to signal all stages are empty
  for (output-tile) {
    for (cta-k) {
      mbarrier wait for full stage
      for (warp-k) {
        wgmma_fence;
        wgmma_64m_256m_16k;
      }
      wgmma_commit;
      wgmma_wait;
      mbarrier arrive to signal current stage is empty
    }
    wgmma_wait;
    convert fp32 results to bf16
    stmatrix from registers to shared memory
    block_sync();
    tma store from shared to global memory
    tma_store_commit;
    tma_store_wait;
  }
}
destroy mbarrier

The text was updated successfully, but these errors were encountered:

rdspring1 added Matmuls TMA labels Feb 21, 2025

rdspring1 self-assigned this Feb 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow separate sub-DAG for load and compute warp groups with warp-specialized circular buffering. #3941

Allow separate sub-DAG for load and compute warp groups with warp-specialized circular buffering. #3941

rdspring1 commented Feb 21, 2025 •

edited

Loading

Allow separate sub-DAG for load and compute warp groups with warp-specialized circular buffering. #3941

Allow separate sub-DAG for load and compute warp groups with warp-specialized circular buffering. #3941

Comments

rdspring1 commented Feb 21, 2025 • edited Loading

rdspring1 commented Feb 21, 2025 •

edited

Loading