-
Notifications
You must be signed in to change notification settings - Fork 10.9k
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache #5492
Comments
Note worthy is the fact that |
Llama.cpp only supports 8 bit k cache. V 8 bit is not implemented yet |
Not true, type Q4_0 and Q4_1 k cache quantization works for me and are documented in this PR: |
This issue is stale because it has been open for 30 days with no activity. |
Is anyone else still interested in this feature? It would be incredibly helpful for running long contexts on systems with limited VRAM |
@ikawrakow Any thing you can help with implement this on the project?We have a lots of progress on weight quants but we re still using FP16 kv cache :) |
I have been using q8_0 for the k part of the cache for a long time now without any issues.
|
To me it looks like the topic of quantized cache needs more attention from the project maintainers rather than quantization improvements:
|
This issue was closed because it has been inactive for 14 days since being marked as stale. |
@ggerganov With FA merged, Are there any chance to improve speed of kv quants so it become useful |
Feature Description
with KV cache quantized in 2bits. This brings 2.6× less peak memory on the Llama/Mistral/Falcon models we evaluated while enabling 4x larger batch size, resulting in 2.35× - 3.47× throughput improvement.
Motivation
Reduce memory use by Kv cache during long context batch inference
https://arxiv.org/abs/2402.02750
https://github.com/jy-yuan/KIVI
it was publish at reddit
https://www.reddit.com/r/LocalLLaMA/comments/1ap3bkt/kv_cache_is_huge_and_bottlenecks_llm_inference_we/
Possible Implementation
https://github.com/jy-yuan/KIVI
I find it quite interesting, it might improve a lot for VRAM poor users even without large batch or long context.
The text was updated successfully, but these errors were encountered: