Using LQER to improve low-bit quants #8831

compilade · 2024-08-02T15:01:58Z

compilade
Aug 2, 2024
Collaborator

Since the recent LoRA refactor by @ngxson in #8332, I think it should be possible to improve existing quantization schemes with Low-Rank Quantization Error Reconstruction (see https://arxiv.org/abs/2402.02446)

It would only need two things:

Some script to make a LoRA out of the differences of two GGUF files of the same model for tensors which can use LoRA
- If in Python, dequantization functions would need to be written
  - I will refactor gguf-py/gguf/quants.py to make this easier.
- If in C++, we'd need to find a way to do SVD (Singular Value Decomposition)
Allowing to store a LoRA adapter along with the model weights in a single model file
- Do we also need some extra metadata to signal that a model file (with dense weights) has a LoRA adapter inside?

And also I think L²QER could be implemented with the existing imatrix files.

ngxson · 2024-08-02T20:36:44Z

ngxson
Aug 2, 2024
Collaborator

I read the paper but it's still not very clear for me if L²QER change something at inference time or not.

Please correct if I'm wrong, but what I understand is that we firstly do LQER by running SVD to get LoRA, then further optimizing it via L²QER (around page 6 of the paper).

Anyway,

If in Python, dequantization functions would need to be written

FYI, someone already done the python implementation of GGML quants in HF transformers library. Hopefully this can help your implementation.

If in C++, we'd need to find a way to do SVD (Singular Value Decomposition)

I'm really interested in this, because control vector can be benefit from SVD (we're currently using PCA)

Do we also need some extra metadata to signal that a model file (with dense weights) has a LoRA adapter inside?

Yes, I think it would be useful to know if we should call llama_lora_adapter_init (or maybe a refactored version of it)

The current llama_lora_adapter_init function does not use mmap at all, that need to be fixed before we can proceed. Also, lora with split (shard) model is not supported atm. Hopefully we can bring more gguf-reader logic (i.e. with mmap & shard support) into ggml library, so it can be reused anywhere.

4 replies

compilade Aug 2, 2024
Collaborator Author

I read the paper but it's still not very clear for me if L²QER change something at inference time or not.

Please correct if I'm wrong, but what I understand is that we firstly do LQER by running SVD to get LoRA, then further optimizing it via L²QER (around page 6 of the paper).

From my understanding of Section 3.2, L²QER requires to scale the error by $S$ before SVD, and then $S$ is cancelled out when making A and B. $S$ comes from the activations (described in Appendix A), and this looks similar to what imatrix does (except that imatrix seems to do a mean of sums of squares of the activations (I think?) instead of a max of the absolute means), but I think imatrix data could maybe still be used for $S$.

LQER seems to be simply doing SVD of the quantization error, so this will be easier to implement first.

Apart from the LoRA tensors (which are already supported because of #8332), I don't think this changes anything at inference time.

If in C++, we'd need to find a way to do SVD (Singular Value Decomposition)

I'm really interested in this, because control vector can be benefit from SVD (we're currently using PCA)

There's a test in ggml with SVD, but I'm not sure if it works, because it seems to use the sgesvd_ function out of nowhere.

FYI, someone already done the python implementation of GGML quants in HF transformers library. Hopefully this can help your implementation.

Nice! This seems to come from https://github.com/99991/pygguf. But I'll likely re-implement the dequantization anyway to better fit how gguf-py/gguf/quants.py will be structured as in #8838, and also to better understand their structure to be able to possibly eventually write documentation about that (since @mofosyne seems to be interested in this (ref: #8151 (comment))).

ngxson Aug 3, 2024
Collaborator

There's a test in ggml with SVD, but I'm not sure if it works, because it seems to use the sgesvd_ function out of nowhere

Actually, the test file you mentioned only runs on ARM: ggml-org/ggml@3b3ad42

Indeed, a quick google search shows that sgesvd_ is a function provided by ARM: https://developer.arm.com/documentation/101004/2404/LAPACK-Linear-Algebra-Package/LAPACK-singular-value-decompositions-routines/sgesvd

In anyway, I think we can re-write our own implementation in cpp using ggml. I will try on my side by following this python code, but feel free to tell if you have other suggestions.

jukofyork Aug 3, 2024

In anyway, I think we can re-write our own implementation in cpp using ggml. I will try on my side by following this python code, but feel free to tell if you have other suggestions.

That method is only useful if you want the first couple of singular vectors - the numerical errors will compound massively after.

https://people.duke.edu/~hpgavin/SystemID/References/Golub+Reinsch-NM-1970.pdf

is the standard algorithm used in most software. The ALGOL code at the end of the paper has been (auto) converted numerous times to FORTRAN and C, but the most understandable is probably from the GSL:

https://github.com/ampl/gsl/blob/master/linalg/svd.c

I'm not sure how the GPL-3 and MIT licenses work together, but it's often fairly easy to just lift a function from the GSL source and get it working in isolation.

jukofyork Aug 3, 2024

Here's a better explanation:

https://people.inf.ethz.ch/gander/talks/Vortrag2022.pdf

jukofyork · 2024-08-03T22:34:01Z

jukofyork
Aug 3, 2024

If it's any use then I tidied up the Mergekit code to extract LoRAs from fine turned models:

arcee-ai/mergekit#333

It works on all but the weirdest models now (eg: mistral-v0.3 prepended new tokens!?). I successfully managed to take full rank LoRAs and reconstructs again for a few test cases, so it definitely is working and seems quite numerically stable now the sqrt(S) is distributed.

Might be useful to make a reference version using pytorch and then compare with a native C/C++ reimplemention.

EDIT: The only part of Mergekit it's using is the "lazy tensor loading" and other than that it's mostly based on:

https://github.com/thomasgauthier/LoRD

(I'm not sure how it ended up part of Mergekit)

1 reply

compilade Aug 4, 2024
Collaborator Author

Nice! This is useful.

And the best part is that I think most of the problems you mentioned will not exist when doing LQER on GGUF models, because between a full-precision model and its quantizations,

The tensor names are always the same
The tensor dimensions are always the same

There are basically (almost) no edge cases.

Distributing sqrt(S) seems like a good idea, thanks for mentioning it.

ggerganov · 2024-08-04T09:36:36Z

ggerganov
Aug 4, 2024
Maintainer

Interesting work, will be cool to try to implement this approach and see how the perplexity improves for different ranks

I'm looking at some of the results in the paper and not sure how to interpret Appendix B:

Based on the graph, it seems L2QER performs worse (i.e. higher error) compared to LQER, while the text states the opposite. Am I reading it wrong?

0 replies

jukofyork · 2025-02-14T12:49:11Z

jukofyork
Feb 14, 2025

So I'm just experimenting with this now, but so far have done the opposite of LQER: separate off the top singular values/vectors and quantise the residual (as opposed to quantise and then create the low-rank pair of matrices a/b to approximate the residual). This was quite easy as I can do it all using PyTorch's linear algebra code on the .safetensors files... BUT:

The biggest hurdle in llama.cpp is going to be be a lack of proper linear algebra functions IMO. Yes, you can knock up some gash SVD code in a few days and it will somewhat work OK, but compared to the existing code people have literally spent their whole carriers creating and optimising - it will be pretty unadaptable and not consider corner cases properly, etc.

The GNU Scientific Library mostly implements all this old robust Fortran code :

https://www.gnu.org/software/gsl/doc/html/linalg.html

(for dense BLAS - the sparse BLAS stuff we don't really care about much anyway)

Obviously this isn't much use on its own as everything will be CPU-based and not actually use the GGML back-ends... BUT:

It actually only relies on an implementation of "CBLAS", which it provides itself here:

https://git.savannah.gnu.org/cgit/gsl.git/tree/cblas

but actually it can use any implementation, such as the one in MKL (it's a bit of a PITA to link though):

https://stackoverflow.com/questions/52989133/linking-gsl-c-program-with-intel-mkl
https://github.com/CURTLab/gsl-mkl

and I have seen people also link with the Netlib "wrapper for legacy Fortran" successfully.

So to use this with the GGML back-ends, all you would actually need to do is implement these level 1/2/3 CBLAS functions:

https://git.savannah.gnu.org/cgit/gsl.git/tree/cblas/gsl_cblas.h

and then you would get the full power of the GGML back-ends, but with the carefully curated set of linear algebra algorithms!

You could also skip all the "C" (complex) and "Z" (double complex) functions and many of the matrix types for now:

If you look at the source folder:

https://git.savannah.gnu.org/cgit/gsl.git/tree/cblas

then it's clearly not a huge amount of code to write, and most of it is just pure boilerplate (the tests and the main gsl_cblas.h header look to be about half the source code).

It's actually likely in practice that you could start with the pure-CPU version of GSL's provided BLAS implementation and convert it bit by bit to use the GGML code.

@ggerganov What is your opinion on the idea of having the GSL dependency in llama.cpp? It is all written in C and very quick to compile so no huge bloat or long compile time, like most of the alternative C++ libraries:

https://en.wikipedia.org/wiki/Comparison_of_linear_algebra_libraries

I should make it clear that there are two things here:

We would only need to implement the gsl_cblas.h stuff and then everything higher level in gsl_blas.h (ie: using the gsl_vector and gsl_matrix types) would all work as is (the documentation makes it a bit confusing IMO).

0 replies

jukofyork · 2025-02-14T12:57:03Z

jukofyork
Feb 14, 2025

The only other viable option I can see is to use:

https://www.boost.org/doc/libs/1_87_0/libs/python/doc/html/index.html

which would then let us call the PyTorch linear algebra code (I think the C++ Torch API is pretty much dead now AFAIK?), but then this brings in the massive bloat and long compile times and pretty sure this won't be wanted as a dependency...

0 replies

jukofyork · 2025-02-14T13:53:02Z

jukofyork
Feb 14, 2025

def ggml_quantize_residual(tensor: torch.Tensor, quant_type: gguf.GGMLQuantizationType) -> torch.Tensor:
    """
    Returns the residual between original tensor and its quantized-dequantized version.
    
    Args:
        tensor: Input torch tensor
        quant_type: GGML quantization type to use
        
    Returns:
        Residual tensor (original - reconstructed) on the same device as input
    """
    # Save original device and move to CPU for numpy conversion
    orig_device = tensor.device
    cpu_tensor = tensor.cpu()
    
    # Convert to numpy
    np_tensor = cpu_tensor.numpy()
    
    # Quantize and dequantize
    if quant_type in [gguf.GGMLQuantizationType.F32, gguf.GGMLQuantizationType.F16]:
        dtype = np.float32 if quant_type == gguf.GGMLQuantizationType.F32 else np.float16
        quant = np_tensor.astype(dtype)
        dequant = quant.astype(np.float32)
    else:
        quant = gguf.quants.quantize(np_tensor, quant_type)
        dequant = gguf.quants.dequantize(quant, quant_type)
    
    # Convert dequantized back to torch
    reconstructed = torch.from_numpy(dequant).to(orig_device)
    
    # Calculate and return residual
    return tensor - reconstructed

def svd_compress_residual(tensor: torch.Tensor, rank: int):
    """
    SVD compression for quantized-dequantized residual.
    Returns LoRA matrices (A and B) and residual tensor.
    """
    assert tensor.dtype == torch.float32, f"Expected float32 input tensor, got {tensor.dtype}"

    # Compute the SVD
    U_r, S_r, Vh_r, var_expl = truncated_svd(tensor, r=rank)

    # Create LoRA matrices
    sqrtS = torch.sqrt(S_r)
    lora_A = (sqrtS.unsqueeze(1) * Vh_r).contiguous()  # [rank, input_dim]
    lora_B = (U_r * sqrtS.unsqueeze(0)).contiguous()   # [output_dim, rank]

    new_size = rank * (tensor.shape[0] + tensor.shape[1])
    orig_size = tensor.shape[0] * tensor.shape[1]
    compression_ratio = new_size / orig_size

    print(f"- Rank               : {rank}")
    print(f"- Compression Ratio  : {compression_ratio*100:.2f}%")
    print(f"- Variance Explained : {var_expl*100:.2f}%")
    print(f"- LoRA Shapes        : A {lora_A.shape}, B {lora_B.shape}")

    return lora_A, lora_B

.
.
.

                w_ggml_residual = ggml_quantize_residual(w_deq, quant_type)
                lora_a, lora_b = svd_compress_residual(w_ggml_residual, rank=rank)
.
.
.

    parser.add_argument("--quant-type", type=str, default="Q4_0", choices=["Q4_0", "Q4_1", "Q5_0", "Q5_1", "Q8_0"],
                        help="Quantization type for GGUF export (default: Q4_0)")

Just trying Q4_0 for the proper LQER method, assuming that the python gguf.quants.quantize --> gguf.quants.dequantize round-trip is deterministic and identical to the C++ code for these 5 types:

https://github.com/ggerganov/llama.cpp/blob/master/gguf-py/gguf/quants.py

(ignore the if quant_type in [gguf.GGMLQuantizationType.F32, gguf.GGMLQuantizationType.F16]: case - that was from the old "reverse LQER" method I was trying earlier)

0 replies

jukofyork · 2025-02-14T15:08:55Z

jukofyork
Feb 14, 2025

I'm a little sceptical if this will work.

These are the truncated-SVD stats for Q4_0 gguf.quants.quantize --> gguf.quants.dequantize round-trips:

Rank-256 - adds around 16% extra overhead: (256×(7168 + 2048))/(7168×2048)

Processing: model.layers.10.mlp.experts.82.gate_proj.weight (layer 10, expert 82) torch.Size([2048, 7168])
- Rank               : 256
- Compression Ratio  : 16.07%
- Variance Explained : 24.89%
- LoRA Shapes        : A torch.Size([256, 7168]), B torch.Size([2048, 256])

Processing: model.layers.10.mlp.experts.82.up_proj.weight (layer 10, expert 82) torch.Size([2048, 7168])
- Rank               : 256
- Compression Ratio  : 16.07%
- Variance Explained : 24.96%
- LoRA Shapes        : A torch.Size([256, 7168]), B torch.Size([2048, 256])

Processing: model.layers.10.mlp.experts.83.down_proj.weight (layer 10, expert 83) torch.Size([7168, 2048])
- Rank               : 256
- Compression Ratio  : 16.07%
- Variance Explained : 25.01%
- LoRA Shapes        : A torch.Size([256, 2048]), B torch.Size([7168, 256])

Rank-64 - adds around 4% extra overhead: (64×(7168 + 2048))/(7168×2048)

Processing: model.layers.10.mlp.experts.110.gate_proj.weight (layer 10, expert 110) torch.Size([2048, 7168])
- Rank               : 64
- Compression Ratio  : 4.02%
- Variance Explained : 7.04%
- LoRA Shapes        : A torch.Size([64, 7168]), B torch.Size([2048, 64])

Processing: model.layers.10.mlp.experts.110.up_proj.weight (layer 10, expert 110) torch.Size([2048, 7168])
- Rank               : 64
- Compression Ratio  : 4.02%
- Variance Explained : 7.08%
- LoRA Shapes        : A torch.Size([64, 7168]), B torch.Size([2048, 64])

Processing: model.layers.10.mlp.experts.111.down_proj.weight (layer 10, expert 111) torch.Size([7168, 2048])
- Rank               : 64
- Compression Ratio  : 4.02%
- Variance Explained : 7.10%
- LoRA Shapes        : A torch.Size([64, 2048]), B torch.Size([7168, 64])

This distribution of singular values does look remarkably flat compared to my previous (failed) "reverse LQER" attempt.

0 replies

jukofyork · 2025-02-14T15:18:05Z

jukofyork
Feb 14, 2025

It's a little better on the early layers:

Processing: model.layers.3.mlp.experts.0.down_proj.weight (layer 3, expert 0) torch.Size([7168, 2048])
- Rank               : 256
- Compression Ratio  : 16.07%
- Variance Explained : 26.29%
- LoRA Shapes        : A torch.Size([256, 2048]), B torch.Size([7168, 256])

Processing: model.layers.3.mlp.experts.0.gate_proj.weight (layer 3, expert 0) torch.Size([2048, 7168])
- Rank               : 256
- Compression Ratio  : 16.07%
- Variance Explained : 39.07%
- LoRA Shapes        : A torch.Size([256, 7168]), B torch.Size([2048, 256])

Processing: model.layers.3.mlp.experts.0.up_proj.weight (layer 3, expert 0) torch.Size([2048, 7168])
- Rank               : 256
- Compression Ratio  : 16.07%
- Variance Explained : 39.31%
- LoRA Shapes        : A torch.Size([256, 7168]), B torch.Size([2048, 256])

Processing: model.layers.3.mlp.experts.0.down_proj.weight (layer 3, expert 0) torch.Size([7168, 2048])
- Rank               : 64
- Compression Ratio  : 4.02%
- Variance Explained : 7.65%
- LoRA Shapes        : A torch.Size([64, 2048]), B torch.Size([7168, 64])

Processing: model.layers.3.mlp.experts.0.gate_proj.weight (layer 3, expert 0) torch.Size([2048, 7168])
- Rank               : 64
- Compression Ratio  : 4.02%
- Variance Explained : 15.17%
- LoRA Shapes        : A torch.Size([64, 7168]), B torch.Size([2048, 64])

Processing: model.layers.3.mlp.experts.0.up_proj.weight (layer 3, expert 0) torch.Size([2048, 7168])
- Rank               : 64
- Compression Ratio  : 4.02%
- Variance Explained : 15.18%
- LoRA Shapes        : A torch.Size([64, 7168]), B torch.Size([2048, 64])

But the "reverse LQER" attempt was getting 50%+ Variance Explained for these...

0 replies

jukofyork · 2025-02-15T09:41:30Z

jukofyork
Feb 15, 2025

Slightly better than "reverse LQER", but still not worth bothering with IMO - just adding a extra bit to the quant would give far more improvement than LQER with all the extra overhead it adds...

The only interesting thing is it does show that the early layers (of the MoE tensors of deepseek-v3 at least) have much more of the variance explained for the first few singular vectors/values for the first layers, and then this starts to get flattened more and more as you get further into the layers (might be useful for working out which layers to "bump" for dynamic quants, etc).

Overall LQER seems a waste of time (can't comment to L2QER though).

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using LQER to improve low-bit quants #8831

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 9 comments 5 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Using LQER to improve low-bit quants #8831

compilade Aug 2, 2024 Collaborator

Replies: 9 comments · 5 replies

ngxson Aug 2, 2024 Collaborator

compilade Aug 2, 2024 Collaborator Author

ngxson Aug 3, 2024 Collaborator

compilade Aug 4, 2024 Collaborator Author

ggerganov Aug 4, 2024 Maintainer

Rank-256 - adds around 16% extra overhead: (256×(7168 + 2048))/(7168×2048)

Rank-64 - adds around 4% extra overhead: (64×(7168 + 2048))/(7168×2048)

compilade
Aug 2, 2024
Collaborator

Replies: 9 comments 5 replies

ngxson
Aug 2, 2024
Collaborator

compilade Aug 2, 2024
Collaborator Author

ngxson Aug 3, 2024
Collaborator

compilade Aug 4, 2024
Collaborator Author

ggerganov
Aug 4, 2024
Maintainer