21 Jun 18:47

v0.0.6

0.0.6 (2024-06-21)

Performance Improvements

~~use 1x4 warp layout for small query length~~ (not activated because of large binary size) (#322) (4e89b4d)

Assets 27

20 Jun 08:42

v0.0.5

0.0.5 (2024-06-20)

Highlights

Support any GQA group size for tensor-cores kernels.
Support any page size for tensor-cores kernels.
Support CUDA-Graph for prefill/decode APIs.
Add an option to accelerate decode kernels with Tensor Cores.
Support custom attention mask. (https://docs.flashinfer.ai/tutorials/kv_layout.html#mask-layout-2d-ragged-tensor)
Support logits cap in Grok-1 models.
Fused GPU-sampling kernels: top-p, top-k, speculative verification. (https://docs.flashinfer.ai/api/python/sampling.html)
PyTorch wrapper of group-gemm cutlass kernels. (https://docs.flashinfer.ai/api/python/group_gemm.html)

Acknowledgement

We thank @ibsidorenko, @LiuXiaoxuanPKU, @Yard1 @AgrawalAmey, @xuzhenqi, @mgerstgrasser, @esmeetu, @yz-tang, @HSQ79815, @Qubitium, @shreygupta2809, @sighingnow, @vinx13, @tqchen, @merrymercy, @comaniac and many others for their contributions and helpful discussions for 0.0.5 release.

Refactor

support any GQA group size for tensor-cores kernels (#301) (c111ca)
support any page size for tensor-cores kernels (#306) (82fd8c)

Features

add use_tensor_cores option to decode kernels to accelerate GQA (#317) (3b50dd5)
add group gemm operators (#282) (e08ba42)
initial support of distributed operators (#289) (03553da)
initial support of logits hook (#298) (ab1e2ad)
Separate Q and KV dtypes for decode (#286) (5602659)
support cuda graph for batched multi-query(prefill/append) attention (#275) (83ceb67)
support cuda graph for batched multi-query(prefill/append) attention (#277) (24cc583)
support custom attention mask in prefill/append attention kernels (#266) (7304282)
fused speculative sampilng kernels (#259) (cea2bb)
expose sampling APIs in pytorch (#238) (092902)

Performance Improvements

initial cuda graph support (#256) (7e9cc7f)
split kv-cache for prefill/append kernels (#310) (f0bb0a3)
use packed bit array for attention mask (#308) (3d43dc9)

Assets 27

02 May 07:52

v0.0.4

0.0.4 (2024-05-01)

Features

pytorch 2.3 support
more gqa group sizes
add mma instructions for fp8 (#179) (d305798)
mma rowsum for fp8 (#180) (5af935c)
support any num_heads for get_alibi_slope (#200) (b217a6f)

Bug Fixes

fix python package dispatch error message (#182) (8eed01c)

Assets 27

08 Mar 10:06

v0.0.3

0.0.3 (2024-03-08)

Features

adding sm_scale field for all attention APIs (#145) (85d4018)
enable head_dim=256 for attention kernels (#132) (0372acc)
pytorch api of fp8 kv-cache (#156) (66ee066)
support ALiBi (#146) (383518b)

Misc

add stream argument in BeginForwardFunction of TVMWrapper (#164) (fabfcb5)

Bug Fixes

bugfix to pr 135 (#136) (3d55c71)
fix bugs introduced in #132 (#135) (9b7b0b9)
fix FindThrust.cmake (#161) (30fa584)

Performance Improvements

multiple q by sm_scale in decode kernels (#144) (660c559)

Assets 19

16 Feb 11:38

yzh119

Release v0.0.2

Changelog

Support RoPE position info in batch prefill/decode kernels #69 (C++ API only)
Use Torch's current stream for ops #111
Add pre-built wheels for different pytorch versions. #110
Add pre-built wheels for py39 #114

Assets 15

31 Jan 19:03

yzh119

Release v0.0.1

Assets 7