v0.0.9
0.0.9 (2024-07-12)
Bugfix
- fix decode kernels output for empty kv cache (#363)(ac72b1)
- check gpu id in PyTorch APIs and use input tensor's gpu default stream (#361)(1b84fa)
Performance Improvements
- accelerate alibi (#365) (4f0a9f9)
- accelerate gqa performance (#356) (e56ddad)
- Optimize tensor conversions in C++ code to avoid unnecessary copies (#366) (1116237)
Acknowledgement
We thank @Yard1, @Ying1123 and @zhyncs for their contributions.