Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

about precision loss #52

Closed
sinoaidi opened this issue Sep 26, 2024 · 3 comments
Closed

about precision loss #52

sinoaidi opened this issue Sep 26, 2024 · 3 comments
Labels
question Further information is requested

Comments

@sinoaidi
Copy link

Compared with llama.cpp, does tmac lose precision when running quantized models, or does it give the same results? I am running qwen1.5 4bit(https://huggingface.co/Qwen/Qwen1.5-4B-Chat-GPTQ-Int4) now, and I found that the answers given by the model are sometimes wrong, especially in English. like this:
微信截图_20240926091000

@BarfingLemurs
Copy link

Yes, but their GPTQ has their groupsize 128
https://huggingface.co/Qwen/Qwen1.5-4B-Chat-GPTQ-Int4/blob/ff03f8a9647d68587c4bc621eeafd61c9df4487b/config.json#L29

While the understanding is, groupsize of 32 would be better.

From the source: ggml-org/llama.cpp#1684

In the existing ggml quantization types we have "type-0" (Q4_0, Q5_0) and "type-1" (Q4_1, Q5_1). In "type-0", weights w are obtained from quants q using w = d * q, where d is the block scale. In "type-1", weights are given by w = d * q + m, where m is the block minimum.

  • Q4_K_M carries a mixture of these complex components:

GGML_TYPE_Q4_K - "type-1" 4-bit quantization in super-blocks containing 8 blocks, each block having 32 weights. Scales and mins are quantized with 6 bits. This ends up using 4.5 bpw.
GGML_TYPE_Q5_K - "type-1" 5-bit quantization. Same super-block structure as GGML_TYPE_Q4_K resulting in 5.5 bpw
GGML_TYPE_Q6_K - "type-0" 6-bit quantization. Super-blocks with 16 blocks, each block having 16 weights. Scales are quantized with 8 bits. This ends up using 6.5625 bpw

@kaleid-liner kaleid-liner added the question Further information is requested label Sep 26, 2024
@kaleid-liner
Copy link
Collaborator

There are several ways to increase the model performance of GPTQ, including using w4g64 instead of w4g128, or doing QAT such as EfficientQAT. Another cause is the quality of prompt engineering. Some models, without correct prompt, can output random results.

From our experience, qwen2 GPTQ w4g128 already performs well enough. However, we are still working on merging latest llama.cpp to support qwen2. Track the process through #46 .

@sinoaidi
Copy link
Author

@kaleid-liner @BarfingLemurs thank you very much.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants