Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

gguf-py : simplify support for quant types #8838

Merged
merged 5 commits into from
Aug 8, 2024

Conversation

compilade
Copy link
Collaborator

There are only 2 types right now supported by gguf-py/gguf/quants.py (BF16 and Q8_0), but there will be more over time, especially for dequantization, because this could enable interesting things as in #8831

Here, I'm reducing the quant-type-specific code needed to be written by using an abstract base class.

  • The quant-type-specific code now only needs to handle quantizing and dequantizing groups of blocks, without having to worry about the actual shape or original type of the quantized tensor.
  • No need to manually adapt the quantization to lazy tensors, that's also handled by the base class.

I've also simplified the type selection logic in convert_hf_to_gguf.py, which should allow making overrides like in #8715 simpler to implement in a more maintainable way.


I've tested with https://huggingface.co/Qwen/Qwen2-0.5B-Instruct that conversion with convert_hf_to_gguf.py is still the same as when using llama-quantize from a F32 conversion.

The files ending with -q were made with llama-quantize.

$ sha256sum Qwen2-0.5B-Instruct-*
e7b4db70a75dde91b59b5364e4ad90ebfee619b1eaaa9873d860b17c5fb05fdd  Qwen2-0.5B-Instruct-BF16.gguf
e7b4db70a75dde91b59b5364e4ad90ebfee619b1eaaa9873d860b17c5fb05fdd  Qwen2-0.5B-Instruct-BF16-q.gguf
202b115d3fd225217917e75c443d96aa872ebc35fdc6071e121967522088dc78  Qwen2-0.5B-Instruct-F16.gguf
202b115d3fd225217917e75c443d96aa872ebc35fdc6071e121967522088dc78  Qwen2-0.5B-Instruct-F16-q.gguf
e42be5f80c75c3d228b58c8dd523c9f6f04c37f070400da0380ae09df6bc6222  Qwen2-0.5B-Instruct-F32.gguf
f627c98832103ba056007ca8d8f12836986cc38a81f21458ee1cec1a6966292c  Qwen2-0.5B-Instruct-Q8_0.gguf
f627c98832103ba056007ca8d8f12836986cc38a81f21458ee1cec1a6966292c  Qwen2-0.5B-Instruct-Q8_0-q.gguf

I've also tested https://huggingface.co/state-spaces/mamba-130m-hf:

$ sha256sum mamba-130M-hf-*
348f8909e5c439e89666e9d40eba0376a64094b47c85a768d73d4083cfe2c650  mamba-130M-hf-BF16.gguf
348f8909e5c439e89666e9d40eba0376a64094b47c85a768d73d4083cfe2c650  mamba-130M-hf-BF16-q.gguf
006f0e2e16cdb009e60a4b1a187b127f7e32df93dfc1b989f230c95dc3408b5a  mamba-130M-hf-F16.gguf
006f0e2e16cdb009e60a4b1a187b127f7e32df93dfc1b989f230c95dc3408b5a  mamba-130M-hf-F16-q.gguf
f83181b53c678507de08d98a0d72e487ad794099dcd895e663f935f0281c258b  mamba-130M-hf-F32.gguf
d0697bf1167845c67635939b9de5a1e8b64f21e67acaf4b858ed6e72f850e590  mamba-130M-hf-Q8_0.gguf
d0697bf1167845c67635939b9de5a1e8b64f21e67acaf4b858ed6e72f850e590  mamba-130M-hf-Q8_0-q.gguf

@compilade compilade added refactoring Refactoring Review Complexity : Low Trivial changes to code that most beginner devs (or those who want a break) can tackle. e.g. UI fix python python script changes labels Aug 2, 2024
Comment on lines 1199 to 1237
# Default quantization type for each file type
# Keep this the same as in llama_model_quantize_internal from llama.cpp
LlamaFileTypeMap: dict[LlamaFileType, GGMLQuantizationType] = {
LlamaFileType.MOSTLY_Q4_0: GGMLQuantizationType.Q4_0,
LlamaFileType.MOSTLY_Q4_1: GGMLQuantizationType.Q4_1,
LlamaFileType.MOSTLY_Q5_0: GGMLQuantizationType.Q5_0,
LlamaFileType.MOSTLY_Q5_1: GGMLQuantizationType.Q5_1,
LlamaFileType.MOSTLY_Q8_0: GGMLQuantizationType.Q8_0,
LlamaFileType.MOSTLY_F16: GGMLQuantizationType.F16,
LlamaFileType.MOSTLY_BF16: GGMLQuantizationType.BF16,
LlamaFileType.ALL_F32: GGMLQuantizationType.F32,

# K-quants
LlamaFileType.MOSTLY_Q2_K_S: GGMLQuantizationType.Q2_K,
LlamaFileType.MOSTLY_Q2_K: GGMLQuantizationType.Q2_K,
LlamaFileType.MOSTLY_IQ3_XS: GGMLQuantizationType.IQ3_S,
LlamaFileType.MOSTLY_Q3_K_S: GGMLQuantizationType.Q3_K,
LlamaFileType.MOSTLY_Q3_K_M: GGMLQuantizationType.Q3_K,
LlamaFileType.MOSTLY_Q3_K_L: GGMLQuantizationType.Q3_K,
LlamaFileType.MOSTLY_Q4_K_S: GGMLQuantizationType.Q4_K,
LlamaFileType.MOSTLY_Q4_K_M: GGMLQuantizationType.Q4_K,
LlamaFileType.MOSTLY_Q5_K_S: GGMLQuantizationType.Q5_K,
LlamaFileType.MOSTLY_Q5_K_M: GGMLQuantizationType.Q5_K,
LlamaFileType.MOSTLY_Q6_K: GGMLQuantizationType.Q6_K,
LlamaFileType.MOSTLY_IQ2_XXS: GGMLQuantizationType.IQ2_XXS,
LlamaFileType.MOSTLY_IQ2_XS: GGMLQuantizationType.IQ2_XS,
LlamaFileType.MOSTLY_IQ2_S: GGMLQuantizationType.IQ2_XS,
LlamaFileType.MOSTLY_IQ2_M: GGMLQuantizationType.IQ2_S,
LlamaFileType.MOSTLY_IQ3_XXS: GGMLQuantizationType.IQ3_XXS,
LlamaFileType.MOSTLY_IQ1_S: GGMLQuantizationType.IQ1_S,
LlamaFileType.MOSTLY_IQ1_M: GGMLQuantizationType.IQ1_M,
LlamaFileType.MOSTLY_IQ4_NL: GGMLQuantizationType.IQ4_NL,
LlamaFileType.MOSTLY_IQ4_XS: GGMLQuantizationType.IQ4_XS,
LlamaFileType.MOSTLY_IQ3_S: GGMLQuantizationType.IQ3_S,
LlamaFileType.MOSTLY_IQ3_M: GGMLQuantizationType.IQ3_S,
LlamaFileType.MOSTLY_Q4_0_4_4: GGMLQuantizationType.Q4_0_4_4,
LlamaFileType.MOSTLY_Q4_0_4_8: GGMLQuantizationType.Q4_0_4_8,
LlamaFileType.MOSTLY_Q4_0_8_8: GGMLQuantizationType.Q4_0_8_8,
}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm adding this because it's now used in convert_hf_to_gguf.py to get the default quantization type from a file type, but I'm not sure if the file types which are not used in the convert script should still be mapped.

Anyone has an opinion on that?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general we should avoid coupling gguf with llama.cpp specifically. The llama_ftype enum is specific to llama.cpp so maybe it would be better to avoid it

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe at some point we should move the LlamaFileType enum from gguf-py/gguf/constants.py to a new llama.cpp/constants.py and this file can hold the llama.cpp-specific file type logic and potentially other stuff

Too specific to 'llama.cpp', and would be a maintenance burden
to keep up to date.

* gguf-py : add generic quantize and dequantize functions

The quant classes no longer need to be known,
only the target or the source type,
for 'quantize' and 'dequantize', respectively.
@compilade compilade merged commit 3a14e00 into master Aug 8, 2024
12 checks passed
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 15, 2024
* gguf-py : use classes for quants

* convert_hf : simplify internal quantization type selection

* gguf-py : fix flake8 lint

* gguf-py : fix BF16 numpy view type

* gguf-py : remove LlamaFileTypeMap

Too specific to 'llama.cpp', and would be a maintenance burden
to keep up to date.

* gguf-py : add generic quantize and dequantize functions

The quant classes no longer need to be known,
only the target or the source type,
for 'quantize' and 'dequantize', respectively.
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 18, 2024
* gguf-py : use classes for quants

* convert_hf : simplify internal quantization type selection

* gguf-py : fix flake8 lint

* gguf-py : fix BF16 numpy view type

* gguf-py : remove LlamaFileTypeMap

Too specific to 'llama.cpp', and would be a maintenance burden
to keep up to date.

* gguf-py : add generic quantize and dequantize functions

The quant classes no longer need to be known,
only the target or the source type,
for 'quantize' and 'dequantize', respectively.
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
python python script changes refactoring Refactoring Review Complexity : Low Trivial changes to code that most beginner devs (or those who want a break) can tackle. e.g. UI fix
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants