INT8 support - SmoothQuant #71

casper-hansen · 2023-09-25T20:10:44Z

Background

When measuring throughput, it has become apparent that the INT4 quantized weights fall behind FP16 when you increase data parallelism. This is due to the overhead of dequantizing, thus becoming compute-bound. In vLLM PR 1032, the overall throughput of INT4 has been documented to be 33% lower.

Introducing INT8: Instead of being compute-bound and performing worse than FP16 on throughput, we can make use of the INT8 Tensor Cores that modern GPUs use. Using vLLM PR #1112, you will be able to achieve 20% higher throughput.

SmoothQuant

We adapt SmoothQuant to AWQ and also take the bare minimum necessary to enable SmoothQuant. We use the original AWQ loss function to figure out the optimal scales for the weights while letting the SmoothQuant method find the optimal scales for inputs.

SmoothQuant works by quantizing the inputs to linear layers to INT8 by smoothing the inputs that have very large values compared to the rest, i.e. it removes outliers. Additionally, SmoothQuant does not use group_size and it does not use zero_point quantization. This is clever because it enables us to use the CUTLASS INT8 kernels for fast inference, and we do not have to write our own GEMM kernels.

Paper: https://arxiv.org/pdf/2211.10438.pdf
torch-int (adapted): https://github.com/casper-hansen/torch-int
SmoothQuant (adapted): https://github.com/AniZpZ/smoothquant/tree/llama-dev

TODO

Quantizing LLaMa models to INT8
Loading INT8 weights
Enabling fused layers (maybe later)
Sanity check: Make sure input scales are applied, saved, and loaded correctly.
Sanity check: Make sure that we can quantize to INT8 with existing methods.

casper-hansen · 2023-10-09T13:56:13Z

Closing in favor of #98

casper-hansen added 30 commits September 21, 2023 15:35

Added torch-int installation script

d591dec

Import torch int modules

6897d57

Add int8 example (WIP)

1a65690

Add cutlass submodule

22d80cf

Add instruction for recursing submodules

bfb3940

Install CUTLASS script

21bc24c

Add torch int modules

ba97a70

Install cutlass script update

e418553

Initial CUDA extension

3531476

Change to CUTLASS 2.11

49d91bd

Remove cutlass submodule

1cac4d4

Remove cutlass references

b8dd7ea

Update install cutlass script

4b4441e

Initial WQLinear_INT8

41b2f82

Refactor module

df23c24

Simplify logic

81a5098

Align with other WQ modules

bbf5fcf

Load WQ modules

1e64295

input scale is None by default

bb55325

Define layers that need FP16/FP32 -> INT8 conversion of inputs

7582230

Correct import and check if inputs need to be quantized

acbb712

Use WQLinear_INT8

f50f43b

Add comments

3e64322

Add RMSNorm INT8

4f8ef68

Create activation collector

e5fae7e

Run generation of activation scales outside main loop

342cd08

Clean up INT8 linear

6990609

Disable group size on INT8

1a4193e

Enable quantization without zero_point

ed6fcf4

Disable zero point

00f0ca4

casper-hansen added 4 commits September 25, 2023 16:46

Remove unused code

5f997a5

Simplify forward method

764aeb6

Remove unused imports

fa2c16a

Quantizing works

1694db4

casper-hansen mentioned this pull request Sep 26, 2023

INT8 quantization support #45

Open

casper-hansen added 9 commits September 27, 2023 13:35

Remove cutlass repo after install

cca59a7

Fix scale input names

1506107

Speed up loading

2dd082e

Switch to tinyllama for example

79a0841

Initialize INT8 correctly

6ad0cfb

Fix warning

55be99c

Compute and use mean scale as input_scale

bcb576d

Align buffers with original smoothquant

90b9fa3

RMSNorm INT8

590dfa0

casper-hansen mentioned this pull request Sep 27, 2023

Support int8 KVCacheQuant and W8A8 inference in vllm vllm-project/vllm#1112

Closed

6 tasks

casper-hansen added 11 commits September 27, 2023 21:56

Merge branch 'main' into smoothquant

f9c5e12

Create CustomLinear

5910de1

Fix typing

e2e9718

Loading works, generating works but garbage output

f518637

disable_fused argument

11055ba

Remove beta

6d1a14b

Give every layer an input scale

d4833d7

Give every layer an input scale

283d686

Simplify

024fac9

torch.no_grad

b13e8ad

Initial per-tensor support in pseudo_quantize_tensor

9ad9788

casper-hansen closed this Oct 9, 2023

casper-hansen deleted the smoothquant branch January 21, 2024 20:44

Anindyadeep mentioned this pull request Apr 15, 2024

Mistral AWQ with memory profiling premAI-io/benchmarks#161

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

INT8 support - SmoothQuant #71

INT8 support - SmoothQuant #71

casper-hansen commented Sep 25, 2023 •

edited

Loading

casper-hansen commented Oct 9, 2023

INT8 support - SmoothQuant #71

INT8 support - SmoothQuant #71

Conversation

casper-hansen commented Sep 25, 2023 • edited Loading

Background

SmoothQuant

TODO

casper-hansen commented Oct 9, 2023

casper-hansen commented Sep 25, 2023 •

edited

Loading