Use Q4_K for attn_v for Q2_K_S when n_gqa >= 4 #4996

ikawrakow · 2024-01-17T10:22:53Z

I have missed this tweak when adding Q2_K_S.

With this change, model size for Mistral-7B increases by only ~30 MB (0.03 bpw) while

Perplexity for a context of 512 on wiki.test.raw goes down from 6.9259 to 6.7116
10-shot HellaSwag score after 2000 tasks increases by 0.95 +/- 0.42.

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

Use Q4_K for attn_v for Q2_K_S when n_gqa >= 4

9fd1e83

ggerganov approved these changes Jan 17, 2024

View reviewed changes

ggerganov merged commit 2b3a665 into master Jan 17, 2024
41 of 46 checks passed

jordankanter pushed a commit to jordankanter/llama.cpp that referenced this pull request Feb 3, 2024

llama : use Q4_K for attn_v for Q2_K_S when n_gqa >= 4 (ggml-org#4996)

ec0b035

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

hodlen pushed a commit to hodlen/llama.cpp that referenced this pull request Apr 1, 2024

llama : use Q4_K for attn_v for Q2_K_S when n_gqa >= 4 (ggml-org#4996)

d13fd96

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use Q4_K for attn_v for Q2_K_S when n_gqa >= 4 #4996

Use Q4_K for attn_v for Q2_K_S when n_gqa >= 4 #4996

ikawrakow commented Jan 17, 2024

Use Q4_K for attn_v for Q2_K_S when n_gqa >= 4 #4996

Use Q4_K for attn_v for Q2_K_S when n_gqa >= 4 #4996

Conversation

ikawrakow commented Jan 17, 2024