-
Notifications
You must be signed in to change notification settings - Fork 10.9k
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
gguf-py : simplify support for quant types #8838
Conversation
gguf-py/gguf/constants.py
Outdated
# Default quantization type for each file type | ||
# Keep this the same as in llama_model_quantize_internal from llama.cpp | ||
LlamaFileTypeMap: dict[LlamaFileType, GGMLQuantizationType] = { | ||
LlamaFileType.MOSTLY_Q4_0: GGMLQuantizationType.Q4_0, | ||
LlamaFileType.MOSTLY_Q4_1: GGMLQuantizationType.Q4_1, | ||
LlamaFileType.MOSTLY_Q5_0: GGMLQuantizationType.Q5_0, | ||
LlamaFileType.MOSTLY_Q5_1: GGMLQuantizationType.Q5_1, | ||
LlamaFileType.MOSTLY_Q8_0: GGMLQuantizationType.Q8_0, | ||
LlamaFileType.MOSTLY_F16: GGMLQuantizationType.F16, | ||
LlamaFileType.MOSTLY_BF16: GGMLQuantizationType.BF16, | ||
LlamaFileType.ALL_F32: GGMLQuantizationType.F32, | ||
|
||
# K-quants | ||
LlamaFileType.MOSTLY_Q2_K_S: GGMLQuantizationType.Q2_K, | ||
LlamaFileType.MOSTLY_Q2_K: GGMLQuantizationType.Q2_K, | ||
LlamaFileType.MOSTLY_IQ3_XS: GGMLQuantizationType.IQ3_S, | ||
LlamaFileType.MOSTLY_Q3_K_S: GGMLQuantizationType.Q3_K, | ||
LlamaFileType.MOSTLY_Q3_K_M: GGMLQuantizationType.Q3_K, | ||
LlamaFileType.MOSTLY_Q3_K_L: GGMLQuantizationType.Q3_K, | ||
LlamaFileType.MOSTLY_Q4_K_S: GGMLQuantizationType.Q4_K, | ||
LlamaFileType.MOSTLY_Q4_K_M: GGMLQuantizationType.Q4_K, | ||
LlamaFileType.MOSTLY_Q5_K_S: GGMLQuantizationType.Q5_K, | ||
LlamaFileType.MOSTLY_Q5_K_M: GGMLQuantizationType.Q5_K, | ||
LlamaFileType.MOSTLY_Q6_K: GGMLQuantizationType.Q6_K, | ||
LlamaFileType.MOSTLY_IQ2_XXS: GGMLQuantizationType.IQ2_XXS, | ||
LlamaFileType.MOSTLY_IQ2_XS: GGMLQuantizationType.IQ2_XS, | ||
LlamaFileType.MOSTLY_IQ2_S: GGMLQuantizationType.IQ2_XS, | ||
LlamaFileType.MOSTLY_IQ2_M: GGMLQuantizationType.IQ2_S, | ||
LlamaFileType.MOSTLY_IQ3_XXS: GGMLQuantizationType.IQ3_XXS, | ||
LlamaFileType.MOSTLY_IQ1_S: GGMLQuantizationType.IQ1_S, | ||
LlamaFileType.MOSTLY_IQ1_M: GGMLQuantizationType.IQ1_M, | ||
LlamaFileType.MOSTLY_IQ4_NL: GGMLQuantizationType.IQ4_NL, | ||
LlamaFileType.MOSTLY_IQ4_XS: GGMLQuantizationType.IQ4_XS, | ||
LlamaFileType.MOSTLY_IQ3_S: GGMLQuantizationType.IQ3_S, | ||
LlamaFileType.MOSTLY_IQ3_M: GGMLQuantizationType.IQ3_S, | ||
LlamaFileType.MOSTLY_Q4_0_4_4: GGMLQuantizationType.Q4_0_4_4, | ||
LlamaFileType.MOSTLY_Q4_0_4_8: GGMLQuantizationType.Q4_0_4_8, | ||
LlamaFileType.MOSTLY_Q4_0_8_8: GGMLQuantizationType.Q4_0_8_8, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm adding this because it's now used in convert_hf_to_gguf.py
to get the default quantization type from a file type, but I'm not sure if the file types which are not used in the convert script should still be mapped.
Anyone has an opinion on that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general we should avoid coupling gguf
with llama.cpp
specifically. The llama_ftype
enum is specific to llama.cpp
so maybe it would be better to avoid it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe at some point we should move the LlamaFileType
enum from gguf-py/gguf/constants.py
to a new llama.cpp/constants.py
and this file can hold the llama.cpp
-specific file type logic and potentially other stuff
Too specific to 'llama.cpp', and would be a maintenance burden to keep up to date. * gguf-py : add generic quantize and dequantize functions The quant classes no longer need to be known, only the target or the source type, for 'quantize' and 'dequantize', respectively.
* gguf-py : use classes for quants * convert_hf : simplify internal quantization type selection * gguf-py : fix flake8 lint * gguf-py : fix BF16 numpy view type * gguf-py : remove LlamaFileTypeMap Too specific to 'llama.cpp', and would be a maintenance burden to keep up to date. * gguf-py : add generic quantize and dequantize functions The quant classes no longer need to be known, only the target or the source type, for 'quantize' and 'dequantize', respectively.
* gguf-py : use classes for quants * convert_hf : simplify internal quantization type selection * gguf-py : fix flake8 lint * gguf-py : fix BF16 numpy view type * gguf-py : remove LlamaFileTypeMap Too specific to 'llama.cpp', and would be a maintenance burden to keep up to date. * gguf-py : add generic quantize and dequantize functions The quant classes no longer need to be known, only the target or the source type, for 'quantize' and 'dequantize', respectively.
There are only 2 types right now supported by
gguf-py/gguf/quants.py
(BF16
andQ8_0
), but there will be more over time, especially for dequantization, because this could enable interesting things as in #8831Here, I'm reducing the quant-type-specific code needed to be written by using an abstract base class.
I've also simplified the type selection logic in
convert_hf_to_gguf.py
, which should allow making overrides like in #8715 simpler to implement in a more maintainable way.I've tested with https://huggingface.co/Qwen/Qwen2-0.5B-Instruct that conversion with
convert_hf_to_gguf.py
is still the same as when usingllama-quantize
from a F32 conversion.The files ending with
-q
were made withllama-quantize
.I've also tested https://huggingface.co/state-spaces/mamba-130m-hf: