Skip to content

What content is inside workspace_buffer and how should I choose a proper buffer size? #260

Answered by yzh119
Tomorrowdawn asked this question in Q&A
Discussion options

You must be logged in to vote

Sorry for the late reply.

These workspace are prepared for two kind of buffers:

  1. integer buffers (metadata): after we tile the attention computation on query length dimension or k/v length dimension, we need some metadata for indexing the page-table after partition.
  2. float buffers (intermediate attention output), we use the split-k trick to increase the parallelism for small problem size, in this case, we need some buffers on global memory to store the intermediate attention results (partial attention output and logsumexp factor, see https://docs.flashinfer.ai/tutorials/recursive_attention.html) corresponding to each kv-chunk, and we need another round of reduction to get the complete atte…

Replies: 1 comment

Comment options

You must be logged in to vote
0 replies
Answer selected by Tomorrowdawn
# for free to join this conversation on GitHub. Already have an account? # to comment
Category
Q&A
Labels
None yet
2 participants