-
Notifications
You must be signed in to change notification settings - Fork 72
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
[ Docs ] Conceptual Guides #18
base: main
Are you sure you want to change the base?
Conversation
|
||
## Theory | ||
|
||
Performing quantization to go from `float16` to `int8` (or lower) is tricky. Only 256 values can be represented in `int8`, while `float16` can represent a very wide range of values. The idea is to find the best way to project our range [a, b] of `float32` values to the `int8` space. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good space for a diagram (bucketing the weight distribution to 256 buckets)
|
||
* **Static quantization**: the range for each activation is computed in advance at quantization-time, typically by passing representative "calibration" data through the model and recording the activation values. In practice, we run a number of forward passes on a calibration dataset is done and compute the ranges according to the observed calibration data. | ||
|
||
In general, it is best practice to start your experiments with: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is it best practice?
Co-authored-by: Benjamin Fineran <bfineran@users.noreply.github.com>
Co-authored-by: Benjamin Fineran <bfineran@users.noreply.github.com>
Co-authored-by: Benjamin Fineran <bfineran@users.noreply.github.com>
Co-authored-by: Benjamin Fineran <bfineran@users.noreply.github.com>
Co-authored-by: Benjamin Fineran <bfineran@users.noreply.github.com>
* test forward (vllm-project#16) * test frozen (vllm-project#17) * test frozen * rename * lifecycle conftest (vllm-project#21) * test initalize (vllm-project#18) * test initalize * newline * parametrize weights and inp_act * remove dup * test lifecycle (vllm-project#19) * test lifecycle * comments * comments * add quantization test * Lifecycle/min max obs (vllm-project#20) * min max test * add minmax obs * test scale range and min_max update * rebase * rebase * fix * fix
SUMMARY:
TEST PLAN: