Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Add FP8 KV Cache quant example #113

Merged
merged 2 commits into from
Aug 27, 2024
Merged

Add FP8 KV Cache quant example #113

merged 2 commits into from
Aug 27, 2024

Conversation

mgoin
Copy link
Member

@mgoin mgoin commented Aug 26, 2024

FIX #111

@mgoin mgoin merged commit ac673b5 into main Aug 27, 2024
4 of 7 checks passed
@mgoin mgoin deleted the kv-cache-fp8-example branch August 27, 2024 23:55
kylesayrs pushed a commit that referenced this pull request Aug 28, 2024
* Add example for quantization kv cache to fp8

* Add eval
markmc pushed a commit to markmc/llm-compressor that referenced this pull request Nov 13, 2024
* compute zp, scale if weight exists in module

* WIP, gets through 1 forward pass

* fix for zeroed out scales

* fix model load

* style

* offload helper fns

* pass tests

* add test to check that observers are used to populate zp and scale in initialization

* fix no calibration case

* clean up for PR

* fix test

* update dependencies

* fix forward bug

* don't calibrate on weights

* dont calib weight in forward

* fix zp load

* check calibration

---------

Co-authored-by: George Ohashi <george@neuralmagic.com>
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Usage] How to do KV cache quantization?
1 participant