Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

iq3_xxs: guards for the no-imatrix situation #5334

Merged
merged 1 commit into from
Feb 5, 2024

Conversation

ikawrakow
Copy link
Contributor

IQ3_XXS can give a very bad quantization when used without an importance matrix (imatrix), see #5332.

Instead of adding a warning or even disallowing IQ3_XXS quantization without an imatrix, this PR prevents a bad outcome by using Q3_K for the attn_v tensors, and a mix of Q4_K and Q3_K for the ffn_down tensors when no imatrix has been supplied. This results in a somewhat larger quantized model (e.g., 2.61 GiB vs 2.5 GiB for 7B LLaMAs) but a more reasonable PPL (e.g., 5.4923 for LLaMA-v2-7B and a context of 4096 vs 100+).

@ggerganov ggerganov changed the title iq3_xxs: quards for the no-imatrix situation iq3_xxs: guards for the no-imatrix situation Feb 5, 2024
@ikawrakow ikawrakow merged commit 89503dc into master Feb 5, 2024
@ikawrakow ikawrakow deleted the ik/iq3xxs_noimatrix_guard branch February 5, 2024 10:32
jordankanter pushed a commit to jordankanter/llama.cpp that referenced this pull request Mar 13, 2024
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
hodlen pushed a commit to hodlen/llama.cpp that referenced this pull request Apr 1, 2024
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants