Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache #5492

Closed
sorasoras opened this issue Feb 14, 2024 · 10 comments
Closed

KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache #5492

sorasoras opened this issue Feb 14, 2024 · 10 comments
Labels
enhancement New feature or request stale

Comments

@sorasoras
Copy link

Feature Description

with KV cache quantized in 2bits. This brings 2.6× less peak memory on the Llama/Mistral/Falcon models we evaluated while enabling 4x larger batch size, resulting in 2.35× - 3.47× throughput improvement.

Motivation

Reduce memory use by Kv cache during long context batch inference
https://arxiv.org/abs/2402.02750
https://github.com/jy-yuan/KIVI

it was publish at reddit
https://www.reddit.com/r/LocalLLaMA/comments/1ap3bkt/kv_cache_is_huge_and_bottlenecks_llm_inference_we/

Possible Implementation

https://github.com/jy-yuan/KIVI

I find it quite interesting, it might improve a lot for VRAM poor users even without large batch or long context.

@sorasoras sorasoras added the enhancement New feature or request label Feb 14, 2024
@Green-Sky
Copy link
Collaborator

Note worthy is the fact that llama.cpp supports kv quantization. Going beyond q8_0 usually leads to very poor quality however.

@Dampfinchen
Copy link

Note worthy is the fact that llama.cpp supports kv quantization. Going beyond q8_0 usually leads to very poor quality however.

Llama.cpp only supports 8 bit k cache. V 8 bit is not implemented yet

@BarfingLemurs
Copy link
Contributor

Not true, type Q4_0 and Q4_1 k cache quantization works for me and are documented in this PR:

#4312

Copy link
Contributor

This issue is stale because it has been open for 30 days with no activity.

@github-actions github-actions bot added the stale label Mar 18, 2024
@DesperateZero
Copy link

Is anyone else still interested in this feature? It would be incredibly helpful for running long contexts on systems with limited VRAM

@slaren slaren removed the stale label Mar 18, 2024
@sorasoras
Copy link
Author

@ikawrakow Any thing you can help with implement this on the project?We have a lots of progress on weight quants but we re still using FP16 kv cache :)

@Green-Sky
Copy link
Collaborator

Green-Sky commented Mar 19, 2024

I have been using q8_0 for the k part of the cache for a long time now without any issues.

llama_new_context_with_model: KV self size = 980.00 MiB, K (q8_0): 340.00 MiB, V (f16): 640.00 MiB

@ikawrakow
Copy link
Contributor

ikawrakow commented Mar 20, 2024

@sorasoras

To me it looks like the topic of quantized cache needs more attention from the project maintainers rather than quantization improvements:

  • Yes, we can have K quantized with Q4_0, Q4_1, Q5_0, Q5_1, or Q8_0, but not V (attempts to use quantized V cache lead to assert in ggml_cuda_cpy_tensor_2d
  • Using quantized K cache leads to a significant drop in inference speed (from 130 t/s to 76 t/s on my RTX-4080). From a quick look the implementation seems far from optimal.
  • Using quantized K cache other than Q8_0 results in significant PPL increase. I personally have a hard time believing that a KV cache quantized with 2 bits as stipulated by this issue and the quoted paper will result in a meaningful generation quality
  • Using more sophisticated quantization techniques, which require significantly more CPU/GPU cycles, will be even more disastrous for performance (at least within the current quantized cache implementation). I did a quick test with IQ4_NL (it seems block size needs to be 32, so IQ4_NL is the only non-legacy quantization type that can be used). I see performance dropping even further to 62 t/s. PPL improves compared to Q4_0, but not compared to Q4_1, so the only thing we gained is a ~17% reduction in the size of the K cache.

@github-actions github-actions bot added the stale label Apr 20, 2024
Copy link
Contributor

github-actions bot commented May 5, 2024

This issue was closed because it has been inactive for 14 days since being marked as stale.

@github-actions github-actions bot closed this as completed May 5, 2024
@sorasoras
Copy link
Author

@ggerganov With FA merged, Are there any chance to improve speed of kv quants so it become useful

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
enhancement New feature or request stale
Projects
None yet
Development

No branches or pull requests

7 participants