-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
about precision loss #52
Comments
Yes, but their GPTQ has their groupsize 128 While the understanding is, groupsize of 32 would be better. From the source: ggml-org/llama.cpp#1684
GGML_TYPE_Q4_K - "type-1" 4-bit quantization in super-blocks containing 8 blocks, each block having 32 weights. Scales and mins are quantized with 6 bits. This ends up using 4.5 bpw. |
There are several ways to increase the model performance of GPTQ, including using w4g64 instead of w4g128, or doing QAT such as EfficientQAT. Another cause is the quality of prompt engineering. Some models, without correct prompt, can output random results. From our experience, qwen2 GPTQ w4g128 already performs well enough. However, we are still working on merging latest llama.cpp to support qwen2. Track the process through #46 . |
@kaleid-liner @BarfingLemurs thank you very much. |
Compared with llama.cpp, does tmac lose precision when running quantized models, or does it give the same results? I am running qwen1.5 4bit(https://huggingface.co/Qwen/Qwen1.5-4B-Chat-GPTQ-Int4) now, and I found that the answers given by the model are sometimes wrong, especially in English. like this:

The text was updated successfully, but these errors were encountered: