Skip to content

Add Q4_1_O quantization format that preserves outliers in weights and does dot in FP32 #825

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Closed

Conversation

saharNooby
Copy link

@saharNooby saharNooby commented Apr 7, 2023

THIS PR IS OPENED BY MISTAKE, IT WAS MEANT FOR rwkv.cpp

Q4_1_O is like Q4_1, but with two important differences:

  • for each block, a single outlier is selected (absmax value) and stored separately, as-is; remaining values are quantized as if there was no outlier at all
  • during inference, dot product in matmul is done in FP32, following weight dequantization; in contrast to Q4_1, which quantized activations and does quantized dot

This format greatly improves perplexity as compared to Q4_1, but the cost is inference that is as slow as FP32.

Perplexity comparison on a private dataset (less is better):

1B5-20220929-ctx4096-Q4_0.bin,   loss [3.079], perplexity  21.745
1B5-20220929-ctx4096-Q4_1.bin,   loss [2.655], perplexity  14.231
1B5-20220929-ctx4096-Q4_1_O.bin, loss [2.204], perplexity   9.060
1B5-20220929-ctx4096-FP16.bin,   loss [2.060], perplexity   7.847

3B-20221110-ctx4096-Q4_0.bin,    loss [4.689], perplexity 108.724
3B-20221110-ctx4096-Q4_1.bin,    loss [2.916], perplexity  18.475
3B-20221110-ctx4096-Q4_1_O.bin,  loss [2.406], perplexity  11.093
3B-20221110-ctx4096-FP16.bin,    loss [2.067], perplexity   7.901

Performance comparison (per-token latency, less is better):

1B5 FP32:   213 ms per token
1B5 FP16:   115 ms per token
1B5 Q4_0:   159 ms per token
1B5 Q4_1:   110 ms per token
1B5 Q4_1_O: 207 ms per token

@saharNooby saharNooby closed this Apr 7, 2023
@saharNooby
Copy link
Author

Ooops, very large mistake -- this PR was meant for rwkv.cpp, sorry!

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants