Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

INT8 support - SmoothQuant #71

Closed
wants to merge 56 commits into from
Closed

INT8 support - SmoothQuant #71

wants to merge 56 commits into from

Conversation

casper-hansen
Copy link
Owner

@casper-hansen casper-hansen commented Sep 25, 2023

Background

When measuring throughput, it has become apparent that the INT4 quantized weights fall behind FP16 when you increase data parallelism. This is due to the overhead of dequantizing, thus becoming compute-bound. In vLLM PR 1032, the overall throughput of INT4 has been documented to be 33% lower.

Introducing INT8: Instead of being compute-bound and performing worse than FP16 on throughput, we can make use of the INT8 Tensor Cores that modern GPUs use. Using vLLM PR #1112, you will be able to achieve 20% higher throughput.

SmoothQuant

We adapt SmoothQuant to AWQ and also take the bare minimum necessary to enable SmoothQuant. We use the original AWQ loss function to figure out the optimal scales for the weights while letting the SmoothQuant method find the optimal scales for inputs.

SmoothQuant works by quantizing the inputs to linear layers to INT8 by smoothing the inputs that have very large values compared to the rest, i.e. it removes outliers. Additionally, SmoothQuant does not use group_size and it does not use zero_point quantization. This is clever because it enables us to use the CUTLASS INT8 kernels for fast inference, and we do not have to write our own GEMM kernels.

Paper: https://arxiv.org/pdf/2211.10438.pdf
torch-int (adapted): https://github.com/casper-hansen/torch-int
SmoothQuant (adapted): https://github.com/AniZpZ/smoothquant/tree/llama-dev

TODO

  • Quantizing LLaMa models to INT8
  • Loading INT8 weights
  • Enabling fused layers (maybe later)
  • Sanity check: Make sure input scales are applied, saved, and loaded correctly.
  • Sanity check: Make sure that we can quantize to INT8 with existing methods.

@casper-hansen
Copy link
Owner Author

Closing in favor of #98

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant