You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hey, wanted to know if we could possibly integrate SpQR quantisation. According to this paper here https://arxiv.org/abs/2306.03078.
SpQR works by identifying and isolating outlier weights, which cause particularly-large quantization errors, and storing them in higher precision, while compressing all other weights to 3-4 bits, and achieves relative accuracy losses of less than 1% in perplexity for highly-accurate LLaMA and Falcon LLMs. This makes it possible to run 33B parameter LLM on a single 24 GB consumer GPU without any performance degradation at 15% speedup thus making powerful LLMs available to consumer without any downsides
This would keep all the benefits of quantising while not losing performance.
Apologies if it has already been implemented or is being already. Please point me to the PR if it is. This repo gets work done so fast!
The text was updated successfully, but these errors were encountered:
nivibilla
changed the title
[User] Insert summary of your issue or enhancement..
[Feature Request] SpQR quantisation
Jun 11, 2023
I also think that the best way for llms to run economically is to fit them on the GPU at hand (or withhin X % when using partial GPU) with dynamically adjusted precision.
The most important weights receive most precision, as much as possible within the RAM constraints.
Hey, wanted to know if we could possibly integrate SpQR quantisation. According to this paper here https://arxiv.org/abs/2306.03078.
This would keep all the benefits of quantising while not losing performance.
Apologies if it has already been implemented or is being already. Please point me to the PR if it is. This repo gets work done so fast!
The text was updated successfully, but these errors were encountered: