feat: support cuda graph for batched multi-query(prefill/append) attention #275

yzh119 · 2024-06-02T06:20:50Z

Followup of #187 and #256

…nd) attention" (#276) Reverts #275

…ntion (#277) #275 is not complete, this pr pushes the remaining changes.

@ibsidorenko

🤖 I have created a release *beep* *boop* --- ## [0.1.0](v0.0.4...v0.1.0) (2024-06-20) ### Highlights * Support any GQA group size support for tensor-cores kernels. * Support any page size support for tensor-cores kernels. * Support CUDA-Graph for prefill/decode APIs. * Add an option to accelerate decode kernels with Tensor Cores. * Support custom attention mask. (https://docs.flashinfer.ai/tutorials/kv_layout.html#mask-layout-2d-ragged-tensor) * Support logits cap in Grok-1 models. * Fused GPU-sampling kernels: top-p, top-k, speculative verification. (https://docs.flashinfer.ai/api/python/sampling.html) * PyTorch wrapper of group-gemm cutlass kernels. (https://docs.flashinfer.ai/api/python/sampling.html) ### Acknowledgement We thank [@ibsidorenko](https://github.com/ibsidorenko), [@LiuXiaoxuanPKU](https://github.com/LiuXiaoxuanPKU), [@Yard1](https://github.com/Yard1) [@AgrawalAmey](https://github.com/AgrawalAmey), [@xuzhenqi](https://github.com/xuzhenqi), [@mgerstgrasser](https://github.com/mgerstgrasser), [@esmeetu](https://github.com/esmeetu), [@yz-tang](https://github.com/yz-tang), [@HSQ79815](https://github.com/HSQ79815), [@Qubitium](https://github.com/Qubitium), [@shreygupta2809](https://github.com/shreygupta2809), [@sighingnow](https://github.com/sighingnow), [@vinx13](https://github.com/vinx13), [@tqchen](https://github.com/tqchen), [@merrymercy](https://github.com/merrymercy), [@comaniac](https://github.com/comaniac) and many others for their contributions and helpful discussions for 0.0.5 release. ### Refactor * support any GQA group size for tensor-cores kernels ([#301](#301)) ([c111ca](c111ca6)) * support any page size for tensor-cores kernels ([#306](#306)) ([82fd8c](82fd8c7)) ### Features * add `use_tensor_cores` option to decode kernels to accelerate GQA ([#317](#317)) ([3b50dd5](3b50dd5)) * add group gemm operators ([#282](#282)) ([e08ba42](e08ba42)) * initial support of distributed operators ([#289](#289)) ([03553da](03553da)) * initial support of logits hook ([#298](#298)) ([ab1e2ad](ab1e2ad)) * Separate Q and KV dtypes for decode ([#286](#286)) ([5602659](5602659)) * support cuda graph for batched multi-query(prefill/append) attention ([#275](#275)) ([83ceb67](83ceb67)) * support cuda graph for batched multi-query(prefill/append) attention ([#277](#277)) ([24cc583](24cc583)) * support custom attention mask in prefill/append attention kernels ([#266](#266)) ([7304282](7304282)) * fused speculative sampilng kernels ([#259](#259)) ([cea2bb](cea2bb9)) * expose sampling APIs in pytorch ([#238](#238)) ([092902](0929023)) ### Performance Improvements * initial cuda graph support ([#256](#256)) ([7e9cc7f](7e9cc7f)) * split kv-cache for prefill/append kernels ([#310](#310)) ([f0bb0a3](f0bb0a3)) * use packed bit array for attention mask ([#308](#308)) ([3d43dc9](3d43dc9)) --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please). --------- Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Zihao Ye <expye@outlook.com>

yzh119 added 3 commits June 2, 2024 06:22

upd

3ceb85a

upd

a07e86f

rebase

9081522

yzh119 force-pushed the prefill-cuda-graph branch from 1a09125 to 9081522 Compare June 2, 2024 06:24

yzh119 added 3 commits June 2, 2024 06:29

typo

fe25f06

another typo

4f7dd54

upd

5d31a4b

yzh119 merged commit 83ceb67 into main Jun 2, 2024

github-actions bot mentioned this pull request Jun 2, 2024

chore(main): release 0.0.5 #232

Merged

yzh119 mentioned this pull request Jun 2, 2024

Revert "feat: support cuda graph for batched multi-query(prefill/append) attention" #276

Merged

yzh119 added a commit that referenced this pull request Jun 2, 2024

Revert "feat: support cuda graph for batched multi-query(prefill/appe…

081a4c5

…nd) attention" (#276) Reverts #275

yzh119 mentioned this pull request Jun 2, 2024

feat: support cuda graph for batched multi-query(prefill/append) attention #277

Merged

yzh119 added a commit that referenced this pull request Jun 2, 2024

feat: support cuda graph for batched multi-query(prefill/append) atte…

24cc583

…ntion (#277) #275 is not complete, this pr pushes the remaining changes.

yzh119 deleted the prefill-cuda-graph branch June 7, 2024 18:21

github-actions bot mentioned this pull request Jul 31, 2024

chore(main): release 0.1.4 #415

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: support cuda graph for batched multi-query(prefill/append) attention #275

feat: support cuda graph for batched multi-query(prefill/append) attention #275

yzh119 commented Jun 2, 2024

feat: support cuda graph for batched multi-query(prefill/append) attention #275

feat: support cuda graph for batched multi-query(prefill/append) attention #275

Conversation

yzh119 commented Jun 2, 2024