Release v0.0.5 · flashinfer-ai/flashinfer

0.0.5 (2024-06-20)

Support any GQA group size for tensor-cores kernels.
Support any page size for tensor-cores kernels.
Support CUDA-Graph for prefill/decode APIs.
Add an option to accelerate decode kernels with Tensor Cores.
Support custom attention mask. (https://docs.flashinfer.ai/tutorials/kv_layout.html#mask-layout-2d-ragged-tensor)
Support logits cap in Grok-1 models.
Fused GPU-sampling kernels: top-p, top-k, speculative verification. (https://docs.flashinfer.ai/api/python/sampling.html)
PyTorch wrapper of group-gemm cutlass kernels. (https://docs.flashinfer.ai/api/python/group_gemm.html)

add use_tensor_cores option to decode kernels to accelerate GQA (#317) (3b50dd5)
add group gemm operators (#282) (e08ba42)
initial support of distributed operators (#289) (03553da)
initial support of logits hook (#298) (ab1e2ad)
Separate Q and KV dtypes for decode (#286) (5602659)
support cuda graph for batched multi-query(prefill/append) attention (#275) (83ceb67)
support cuda graph for batched multi-query(prefill/append) attention (#277) (24cc583)
support custom attention mask in prefill/append attention kernels (#266) (7304282)
fused speculative sampilng kernels (#259) (cea2bb)
expose sampling APIs in pytorch (#238) (092902)