v0.0.5
github-actions
released this
20 Jun 08:42
·
0 commits
to 5c056767339c1153859bd6d581a312da8a0cc775
since this release
0.0.5 (2024-06-20)
Highlights
- Support any GQA group size for tensor-cores kernels.
- Support any page size for tensor-cores kernels.
- Support CUDA-Graph for prefill/decode APIs.
- Add an option to accelerate decode kernels with Tensor Cores.
- Support custom attention mask. (https://docs.flashinfer.ai/tutorials/kv_layout.html#mask-layout-2d-ragged-tensor)
- Support logits cap in Grok-1 models.
- Fused GPU-sampling kernels: top-p, top-k, speculative verification. (https://docs.flashinfer.ai/api/python/sampling.html)
- PyTorch wrapper of group-gemm cutlass kernels. (https://docs.flashinfer.ai/api/python/group_gemm.html)
Acknowledgement
We thank @ibsidorenko, @LiuXiaoxuanPKU, @Yard1 @AgrawalAmey, @xuzhenqi, @mgerstgrasser, @esmeetu, @yz-tang, @HSQ79815, @Qubitium, @shreygupta2809, @sighingnow, @vinx13, @tqchen, @merrymercy, @comaniac and many others for their contributions and helpful discussions for 0.0.5 release.
Refactor
- support any GQA group size for tensor-cores kernels (#301) (c111ca)
- support any page size for tensor-cores kernels (#306) (82fd8c)
Features
- add
use_tensor_cores
option to decode kernels to accelerate GQA (#317) (3b50dd5) - add group gemm operators (#282) (e08ba42)
- initial support of distributed operators (#289) (03553da)
- initial support of logits hook (#298) (ab1e2ad)
- Separate Q and KV dtypes for decode (#286) (5602659)
- support cuda graph for batched multi-query(prefill/append) attention (#275) (83ceb67)
- support cuda graph for batched multi-query(prefill/append) attention (#277) (24cc583)
- support custom attention mask in prefill/append attention kernels (#266) (7304282)
- fused speculative sampilng kernels (#259) (cea2bb)
- expose sampling APIs in pytorch (#238) (092902)