[Feature Request] SpQR quantisation #1802

nivibilla · 2023-06-11T15:00:41Z

Hey, wanted to know if we could possibly integrate SpQR quantisation. According to this paper here https://arxiv.org/abs/2306.03078.

SpQR works by identifying and isolating outlier weights, which cause particularly-large quantization errors, and storing them in higher precision, while compressing all other weights to 3-4 bits, and achieves relative accuracy losses of less than 1% in perplexity for highly-accurate LLaMA and Falcon LLMs. This makes it possible to run 33B parameter LLM on a single 24 GB consumer GPU without any performance degradation at 15% speedup thus making powerful LLMs available to consumer without any downsides

This would keep all the benefits of quantising while not losing performance.
Apologies if it has already been implemented or is being already. Please point me to the PR if it is. This repo gets work done so fast!

cmp-nct · 2023-06-12T01:51:52Z

Related to this: #1256

I also think that the best way for llms to run economically is to fit them on the GPU at hand (or withhin X % when using partial GPU) with dynamically adjusted precision.
The most important weights receive most precision, as much as possible within the RAM constraints.

Green-Sky · 2023-07-01T09:37:11Z

closing in favor of #2061

nivibilla changed the title ~~[User] Insert summary of your issue or enhancement..~~ [Feature Request] SpQR quantisation Jun 11, 2023

Green-Sky closed this as completed Jul 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request] SpQR quantisation #1802

[Feature Request] SpQR quantisation #1802

nivibilla commented Jun 11, 2023

cmp-nct commented Jun 12, 2023

Green-Sky commented Jul 1, 2023

[Feature Request] SpQR quantisation #1802

[Feature Request] SpQR quantisation #1802

Comments

nivibilla commented Jun 11, 2023

cmp-nct commented Jun 12, 2023

Green-Sky commented Jul 1, 2023