KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache #5492

sorasoras · 2024-02-14T16:35:42Z

Feature Description

with KV cache quantized in 2bits. This brings 2.6× less peak memory on the Llama/Mistral/Falcon models we evaluated while enabling 4x larger batch size, resulting in 2.35× - 3.47× throughput improvement.

Motivation

Reduce memory use by Kv cache during long context batch inference
https://arxiv.org/abs/2402.02750
https://github.com/jy-yuan/KIVI

it was publish at reddit
https://www.reddit.com/r/LocalLLaMA/comments/1ap3bkt/kv_cache_is_huge_and_bottlenecks_llm_inference_we/

Possible Implementation

https://github.com/jy-yuan/KIVI

I find it quite interesting, it might improve a lot for VRAM poor users even without large batch or long context.

Green-Sky · 2024-02-14T18:32:25Z

Note worthy is the fact that llama.cpp supports kv quantization. Going beyond q8_0 usually leads to very poor quality however.

Dampfinchen · 2024-02-15T07:50:05Z

Note worthy is the fact that llama.cpp supports kv quantization. Going beyond q8_0 usually leads to very poor quality however.

Llama.cpp only supports 8 bit k cache. V 8 bit is not implemented yet

BarfingLemurs · 2024-02-16T08:01:42Z

Not true, type Q4_0 and Q4_1 k cache quantization works for me and are documented in this PR:

#4312

github-actions · 2024-03-18T01:31:41Z

This issue is stale because it has been open for 30 days with no activity.

DesperateZero · 2024-03-18T13:25:33Z

Is anyone else still interested in this feature? It would be incredibly helpful for running long contexts on systems with limited VRAM

sorasoras · 2024-03-19T06:30:38Z

@ikawrakow Any thing you can help with implement this on the project？We have a lots of progress on weight quants but we re still using FP16 kv cache ：）

Green-Sky · 2024-03-19T10:40:06Z

I have been using q8_0 for the k part of the cache for a long time now without any issues.

llama_new_context_with_model: KV self size = 980.00 MiB, K (q8_0): 340.00 MiB, V (f16): 640.00 MiB

ikawrakow · 2024-03-20T12:35:13Z

@sorasoras

To me it looks like the topic of quantized cache needs more attention from the project maintainers rather than quantization improvements:

Yes, we can have K quantized with Q4_0, Q4_1, ~~Q5_0, Q5_1~~, or Q8_0, but not V (attempts to use quantized V cache lead to assert in ggml_cuda_cpy_tensor_2d
Using quantized K cache leads to a significant drop in inference speed (from 130 t/s to 76 t/s on my RTX-4080). From a quick look the implementation seems far from optimal.
Using quantized K cache other than Q8_0 results in significant PPL increase. I personally have a hard time believing that a KV cache quantized with 2 bits as stipulated by this issue and the quoted paper will result in a meaningful generation quality
Using more sophisticated quantization techniques, which require significantly more CPU/GPU cycles, will be even more disastrous for performance (at least within the current quantized cache implementation). I did a quick test with IQ4_NL (it seems block size needs to be 32, so IQ4_NL is the only non-legacy quantization type that can be used). I see performance dropping even further to 62 t/s. PPL improves compared to Q4_0, but not compared to Q4_1, so the only thing we gained is a ~17% reduction in the size of the K cache.

github-actions · 2024-05-05T01:06:45Z

This issue was closed because it has been inactive for 14 days since being marked as stale.

sorasoras · 2024-05-10T06:43:50Z

@ggerganov With FA merged, Are there any chance to improve speed of kv quants so it become useful

sorasoras added the enhancement New feature or request label Feb 14, 2024

github-actions bot added the stale label Mar 18, 2024

slaren removed the stale label Mar 18, 2024

github-actions bot added the stale label Apr 20, 2024

github-actions bot closed this as completed May 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache #5492

KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache #5492

sorasoras commented Feb 14, 2024

Green-Sky commented Feb 14, 2024

Dampfinchen commented Feb 15, 2024

BarfingLemurs commented Feb 16, 2024

github-actions bot commented Mar 18, 2024

DesperateZero commented Mar 18, 2024

sorasoras commented Mar 19, 2024

Green-Sky commented Mar 19, 2024 •

edited

Loading

ikawrakow commented Mar 20, 2024 •

edited

Loading

github-actions bot commented May 5, 2024

sorasoras commented May 10, 2024

KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache #5492

KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache #5492

Comments

sorasoras commented Feb 14, 2024

Feature Description

Motivation

Possible Implementation

Green-Sky commented Feb 14, 2024

Dampfinchen commented Feb 15, 2024

BarfingLemurs commented Feb 16, 2024

github-actions bot commented Mar 18, 2024

DesperateZero commented Mar 18, 2024

sorasoras commented Mar 19, 2024

Green-Sky commented Mar 19, 2024 • edited Loading

ikawrakow commented Mar 20, 2024 • edited Loading

github-actions bot commented May 5, 2024

sorasoras commented May 10, 2024

Green-Sky commented Mar 19, 2024 •

edited

Loading

ikawrakow commented Mar 20, 2024 •

edited

Loading