gguf-py : simplify support for quant types #8838

compilade · 2024-08-02T21:55:42Z

There are only 2 types right now supported by gguf-py/gguf/quants.py (BF16 and Q8_0), but there will be more over time, especially for dequantization, because this could enable interesting things as in #8831

Here, I'm reducing the quant-type-specific code needed to be written by using an abstract base class.

The quant-type-specific code now only needs to handle quantizing and dequantizing groups of blocks, without having to worry about the actual shape or original type of the quantized tensor.
No need to manually adapt the quantization to lazy tensors, that's also handled by the base class.

I've also simplified the type selection logic in convert_hf_to_gguf.py, which should allow making overrides like in #8715 simpler to implement in a more maintainable way.

I've tested with https://huggingface.co/Qwen/Qwen2-0.5B-Instruct that conversion with convert_hf_to_gguf.py is still the same as when using llama-quantize from a F32 conversion.

The files ending with -q were made with llama-quantize.

$ sha256sum Qwen2-0.5B-Instruct-*
e7b4db70a75dde91b59b5364e4ad90ebfee619b1eaaa9873d860b17c5fb05fdd  Qwen2-0.5B-Instruct-BF16.gguf
e7b4db70a75dde91b59b5364e4ad90ebfee619b1eaaa9873d860b17c5fb05fdd  Qwen2-0.5B-Instruct-BF16-q.gguf
202b115d3fd225217917e75c443d96aa872ebc35fdc6071e121967522088dc78  Qwen2-0.5B-Instruct-F16.gguf
202b115d3fd225217917e75c443d96aa872ebc35fdc6071e121967522088dc78  Qwen2-0.5B-Instruct-F16-q.gguf
e42be5f80c75c3d228b58c8dd523c9f6f04c37f070400da0380ae09df6bc6222  Qwen2-0.5B-Instruct-F32.gguf
f627c98832103ba056007ca8d8f12836986cc38a81f21458ee1cec1a6966292c  Qwen2-0.5B-Instruct-Q8_0.gguf
f627c98832103ba056007ca8d8f12836986cc38a81f21458ee1cec1a6966292c  Qwen2-0.5B-Instruct-Q8_0-q.gguf

I've also tested https://huggingface.co/state-spaces/mamba-130m-hf:

$ sha256sum mamba-130M-hf-*
348f8909e5c439e89666e9d40eba0376a64094b47c85a768d73d4083cfe2c650  mamba-130M-hf-BF16.gguf
348f8909e5c439e89666e9d40eba0376a64094b47c85a768d73d4083cfe2c650  mamba-130M-hf-BF16-q.gguf
006f0e2e16cdb009e60a4b1a187b127f7e32df93dfc1b989f230c95dc3408b5a  mamba-130M-hf-F16.gguf
006f0e2e16cdb009e60a4b1a187b127f7e32df93dfc1b989f230c95dc3408b5a  mamba-130M-hf-F16-q.gguf
f83181b53c678507de08d98a0d72e487ad794099dcd895e663f935f0281c258b  mamba-130M-hf-F32.gguf
d0697bf1167845c67635939b9de5a1e8b64f21e67acaf4b858ed6e72f850e590  mamba-130M-hf-Q8_0.gguf
d0697bf1167845c67635939b9de5a1e8b64f21e67acaf4b858ed6e72f850e590  mamba-130M-hf-Q8_0-q.gguf

I have read the contributing guidelines
Self-reported review complexity:
- Low

compilade · 2024-08-02T22:14:15Z

gguf-py/gguf/constants.py

+# Default quantization type for each file type
+# Keep this the same as in llama_model_quantize_internal from llama.cpp
+LlamaFileTypeMap: dict[LlamaFileType, GGMLQuantizationType] = {
+    LlamaFileType.MOSTLY_Q4_0: GGMLQuantizationType.Q4_0,
+    LlamaFileType.MOSTLY_Q4_1: GGMLQuantizationType.Q4_1,
+    LlamaFileType.MOSTLY_Q5_0: GGMLQuantizationType.Q5_0,
+    LlamaFileType.MOSTLY_Q5_1: GGMLQuantizationType.Q5_1,
+    LlamaFileType.MOSTLY_Q8_0: GGMLQuantizationType.Q8_0,
+    LlamaFileType.MOSTLY_F16:  GGMLQuantizationType.F16,
+    LlamaFileType.MOSTLY_BF16: GGMLQuantizationType.BF16,
+    LlamaFileType.ALL_F32:     GGMLQuantizationType.F32,
+
+    # K-quants
+    LlamaFileType.MOSTLY_Q2_K_S:  GGMLQuantizationType.Q2_K,
+    LlamaFileType.MOSTLY_Q2_K:    GGMLQuantizationType.Q2_K,
+    LlamaFileType.MOSTLY_IQ3_XS:  GGMLQuantizationType.IQ3_S,
+    LlamaFileType.MOSTLY_Q3_K_S:  GGMLQuantizationType.Q3_K,
+    LlamaFileType.MOSTLY_Q3_K_M:  GGMLQuantizationType.Q3_K,
+    LlamaFileType.MOSTLY_Q3_K_L:  GGMLQuantizationType.Q3_K,
+    LlamaFileType.MOSTLY_Q4_K_S:  GGMLQuantizationType.Q4_K,
+    LlamaFileType.MOSTLY_Q4_K_M:  GGMLQuantizationType.Q4_K,
+    LlamaFileType.MOSTLY_Q5_K_S:  GGMLQuantizationType.Q5_K,
+    LlamaFileType.MOSTLY_Q5_K_M:  GGMLQuantizationType.Q5_K,
+    LlamaFileType.MOSTLY_Q6_K:    GGMLQuantizationType.Q6_K,
+    LlamaFileType.MOSTLY_IQ2_XXS: GGMLQuantizationType.IQ2_XXS,
+    LlamaFileType.MOSTLY_IQ2_XS:  GGMLQuantizationType.IQ2_XS,
+    LlamaFileType.MOSTLY_IQ2_S:   GGMLQuantizationType.IQ2_XS,
+    LlamaFileType.MOSTLY_IQ2_M:   GGMLQuantizationType.IQ2_S,
+    LlamaFileType.MOSTLY_IQ3_XXS: GGMLQuantizationType.IQ3_XXS,
+    LlamaFileType.MOSTLY_IQ1_S:   GGMLQuantizationType.IQ1_S,
+    LlamaFileType.MOSTLY_IQ1_M:   GGMLQuantizationType.IQ1_M,
+    LlamaFileType.MOSTLY_IQ4_NL:  GGMLQuantizationType.IQ4_NL,
+    LlamaFileType.MOSTLY_IQ4_XS:  GGMLQuantizationType.IQ4_XS,
+    LlamaFileType.MOSTLY_IQ3_S:   GGMLQuantizationType.IQ3_S,
+    LlamaFileType.MOSTLY_IQ3_M:   GGMLQuantizationType.IQ3_S,
+    LlamaFileType.MOSTLY_Q4_0_4_4: GGMLQuantizationType.Q4_0_4_4,
+    LlamaFileType.MOSTLY_Q4_0_4_8: GGMLQuantizationType.Q4_0_4_8,
+    LlamaFileType.MOSTLY_Q4_0_8_8: GGMLQuantizationType.Q4_0_8_8,
+}


I'm adding this because it's now used in convert_hf_to_gguf.py to get the default quantization type from a file type, but I'm not sure if the file types which are not used in the convert script should still be mapped.

Anyone has an opinion on that?

In general we should avoid coupling gguf with llama.cpp specifically. The llama_ftype enum is specific to llama.cpp so maybe it would be better to avoid it

Maybe at some point we should move the LlamaFileType enum from gguf-py/gguf/constants.py to a new llama.cpp/constants.py and this file can hold the llama.cpp-specific file type logic and potentially other stuff

Too specific to 'llama.cpp', and would be a maintenance burden to keep up to date. * gguf-py : add generic quantize and dequantize functions The quant classes no longer need to be known, only the target or the source type, for 'quantize' and 'dequantize', respectively.

* gguf-py : use classes for quants * convert_hf : simplify internal quantization type selection * gguf-py : fix flake8 lint * gguf-py : fix BF16 numpy view type * gguf-py : remove LlamaFileTypeMap Too specific to 'llama.cpp', and would be a maintenance burden to keep up to date. * gguf-py : add generic quantize and dequantize functions The quant classes no longer need to be known, only the target or the source type, for 'quantize' and 'dequantize', respectively.

compilade added 4 commits August 2, 2024 15:18

gguf-py : use classes for quants

1ac1a79

convert_hf : simplify internal quantization type selection

5e27e7e

gguf-py : fix flake8 lint

861265b

gguf-py : fix BF16 numpy view type

e82ff5a

compilade added refactoring Refactoring Review Complexity : Low Trivial changes to code that most beginner devs (or those who want a break) can tackle. e.g. UI fix python python script changes labels Aug 2, 2024

compilade commented Aug 2, 2024

View reviewed changes

compilade mentioned this pull request Aug 3, 2024

ggml-quants : ternary packing for TriLMs and BitNet b1.58 #8151

Merged

15 tasks

ggerganov approved these changes Aug 4, 2024

View reviewed changes

compilade merged commit 3a14e00 into master Aug 8, 2024
12 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gguf-py : simplify support for quant types #8838

gguf-py : simplify support for quant types #8838

compilade commented Aug 2, 2024

compilade Aug 2, 2024

ggerganov Aug 3, 2024

ggerganov Aug 4, 2024

gguf-py : simplify support for quant types #8838

gguf-py : simplify support for quant types #8838

Conversation

compilade commented Aug 2, 2024

compilade Aug 2, 2024

Choose a reason for hiding this comment

ggerganov Aug 3, 2024

Choose a reason for hiding this comment

ggerganov Aug 4, 2024

Choose a reason for hiding this comment