Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

perf: accelerate gqa performance #356

Merged
merged 2 commits into from
Jul 4, 2024
Merged

perf: accelerate gqa performance #356

merged 2 commits into from
Jul 4, 2024

Conversation

yzh119
Copy link
Collaborator

@yzh119 yzh119 commented Jul 3, 2024

Changes:

  1. Prefetch page indices (we have already done such optimization on decode kernels, but not on append/prefill kernels which was used in GQA).
  2. Unlock 1x4 warp layout in perf: use 1x4 warp layout for small query length #322, we didn't enable this because the binary size is too large, we should further reduce some unnecessary template arguments.
  3. Optimize threadblock_sync_mdo_states for efficient merging attention states of multiple warps in a threadblock. Our previous implementation assumes small shared memory size and interleaves shared memory reads/writes with computations, which is not as efficient as a bulk shared memory access.

After this PR, the GQA kernel execution time (on H100) for setting batch_size=128, seq_len=1024, num_qo_heads=32, num_kv_heads=4, head_dim=128 was improved from 133us to 103us.

@yzh119 yzh119 merged commit e56ddad into main Jul 4, 2024
@yzh119 yzh119 deleted the accelerate-gqa branch July 5, 2024 22:57
yzh119 added a commit that referenced this pull request Jul 12, 2024
🤖 I have created a release *beep* *boop*
---


##
[0.0.9](v0.0.8...v0.0.9)
(2024-07-12)

### Bugfix

* fix the decode kernel segfault in cudagraph mode
([#368](https://github.com/flashinfer-ai/flashinfer/pull/368))([c69cfa](https://github.com/flashinfer-ai/flashinfer/commit/c69cfabc540e4a7edd991713df10d575ff3b0c21))
- fix decode kernels output for empty kv cache
([#363](https://github.com/flashinfer-ai/flashinfer/pull/363))([ac72b1](https://github.com/flashinfer-ai/flashinfer/commit/ac72b1cc14a6474d601f371c8d69e2600ac28d2f))
- check gpu id in PyTorch APIs and use input tensor's gpu default stream
([#361](https://github.com/flashinfer-ai/flashinfer/pull/361))([1b84fa](https://github.com/flashinfer-ai/flashinfer/commit/1b84fab3e4f53fb4fa26952fdb46fa8018634057))

### Performance Improvements

* accelerate alibi
([#365](#365))
([4f0a9f9](4f0a9f9))
* accelerate gqa performance
([#356](#356))
([e56ddad](e56ddad))
* Optimize tensor conversions in C++ code to avoid unnecessary copies
([#366](#366))
([1116237](1116237))

### Acknowledgement

We thank [@Yard1](https://github.com/Yard1),
[@Ying1123](https://github.com/Ying1123) and
[@zhyncs](https://github.com/zhyncs) for their contributions.

---
This PR was generated with [Release
Please](https://github.com/googleapis/release-please). See
[documentation](https://github.com/googleapis/release-please#release-please).

---------

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Zihao Ye <expye@outlook.com>
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant