-
Notifications
You must be signed in to change notification settings - Fork 11.3k
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
ggml-cpu: replace AArch64 NEON assembly with intrinsics in ggml_gemv_q4_0_4x4_q8_0() #10567
ggml-cpu: replace AArch64 NEON assembly with intrinsics in ggml_gemv_q4_0_4x4_q8_0() #10567
Conversation
69dd941
to
bde1b96
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using intrinsics is definitely preferred.
I also don't observe significant performance change. Here are the results on M2 Ultra, using the following patch to force the Q4_0_4_4
repack:
diff --git a/ggml/src/ggml-cpu/ggml-cpu-aarch64.c b/ggml/src/ggml-cpu/ggml-cpu-aarch64.c
index 6d2c0adc3..cac45278b 100644
--- a/ggml/src/ggml-cpu/ggml-cpu-aarch64.c
+++ b/ggml/src/ggml-cpu/ggml-cpu-aarch64.c
@@ -3812,7 +3812,7 @@ enum ggml_type ggml_aarch64_get_optimal_repack_type(const struct ggml_tensor * c
return GGML_TYPE_Q4_0_8_8;
}
if (ggml_cpu_has_neon() && ggml_cpu_has_matmul_int8()) {
- return GGML_TYPE_Q4_0_4_8;
+ //return GGML_TYPE_Q4_0_4_8;
}
if (ggml_cpu_has_neon() && ggml_cpu_has_dotprod()) {
return GGML_TYPE_Q4_0_4_4;
make -j && ./bin/llama-bench -m ../models/qwen2.5-1.5b-coder/ggml-model-q4_0.gguf -t 1,2,4,8,16 -p 0 -n 64 -fa 1
master
model | size | params | backend | threads | fa | test | t/s |
---|---|---|---|---|---|---|---|
qwen2 1.5B Q4_0 | 885.97 MiB | 1.54 B | CPU | 1 | 1 | tg64 | 37.44 ± 0.19 |
qwen2 1.5B Q4_0 | 885.97 MiB | 1.54 B | CPU | 2 | 1 | tg64 | 65.51 ± 0.19 |
qwen2 1.5B Q4_0 | 885.97 MiB | 1.54 B | CPU | 4 | 1 | tg64 | 117.41 ± 0.35 |
qwen2 1.5B Q4_0 | 885.97 MiB | 1.54 B | CPU | 8 | 1 | tg64 | 158.53 ± 0.30 |
qwen2 1.5B Q4_0 | 885.97 MiB | 1.54 B | CPU | 16 | 1 | tg64 | 159.32 ± 1.79 |
build: 7281cf1 (4211)
PR
model | size | params | backend | threads | fa | test | t/s |
---|---|---|---|---|---|---|---|
qwen2 1.5B Q4_0 | 885.97 MiB | 1.54 B | CPU | 1 | 1 | tg64 | 36.92 ± 0.12 |
qwen2 1.5B Q4_0 | 885.97 MiB | 1.54 B | CPU | 2 | 1 | tg64 | 66.53 ± 0.55 |
qwen2 1.5B Q4_0 | 885.97 MiB | 1.54 B | CPU | 4 | 1 | tg64 | 114.92 ± 1.64 |
qwen2 1.5B Q4_0 | 885.97 MiB | 1.54 B | CPU | 8 | 1 | tg64 | 158.25 ± 0.17 |
qwen2 1.5B Q4_0 | 885.97 MiB | 1.54 B | CPU | 16 | 1 | tg64 | 157.57 ± 4.43 |
build: bde1b96 (4210)
bde1b96
to
7091a85
Compare
Also I replicated the current C/asm code but I would have done something more like this: - ret = vdotq_laneq_s32(ret, b0 << 4, a0, 0);
- ret = vdotq_laneq_s32(ret, b1 << 4, a0, 1);
- ret = vdotq_laneq_s32(ret, b2 << 4, a0, 2);
- ret = vdotq_laneq_s32(ret, b3 << 4, a0, 3);
+ ret = vdotq_laneq_s32(ret, b0 >> 4, a0, 0);
+ ret = vdotq_laneq_s32(ret, b1 >> 4, a0, 1);
+ ret = vdotq_laneq_s32(ret, b2 >> 4, a0, 2);
+ ret = vdotq_laneq_s32(ret, b3 >> 4, a0, 3);
- ret = vdotq_laneq_s32(ret, b0 & 0xf0U, a1, 0);
- ret = vdotq_laneq_s32(ret, b1 & 0xf0U, a1, 1);
- ret = vdotq_laneq_s32(ret, b2 & 0xf0U, a1, 2);
- ret = vdotq_laneq_s32(ret, b3 & 0xf0U, a1, 3);
+ ret = vdotq_laneq_s32(ret, b0 & 0xfU, a1, 0);
+ ret = vdotq_laneq_s32(ret, b1 & 0xfU, a1, 1);
+ ret = vdotq_laneq_s32(ret, b2 & 0xfU, a1, 2);
+ ret = vdotq_laneq_s32(ret, b3 & 0xfU, a1, 3);
- acc = vfmaq_f32(acc, vcvtq_n_f32_s32(ret, 4),
+ acc = vfmaq_f32(acc, vcvtq_f32_s32(ret), If anyone has an explanation for why it was done this way, I'm interested. |
This does not seem to produce correct output: make -j && ./bin/llama-cli -m ../models/llama-3.2-3b-instruct/ggml-model-q4_0.gguf -p "I believe the meaning of life is" -n 32 -s 1
I believe the meaning of life is anekloatbounceракahnurnجوistle Scene ERA Ordinary spherecord Cheat Spherebenchbenchogle leaguesipogl ordin usefulnessecutenchhec batchjaxrigerrig |
I had only tested with
I'll dig later :) |
@ggerganov I believe the model you're using This is what I get with the
|
// think of exemple
A = 0b1111nnnn;
A[hight] == Q4 = 0b1111
// because it use signed int4
A[hight] == -1
// now if you do:
A>>4 => 0b00001111
// result a value of +15
A&0xF0 => 0b11110000
// result a value of -16 = -1*2^4 so keep it like that if you don't want to deal with negative Q4 Note: hop I do not make to much error with my "math" 😎 |
7587b42
to
6b6b98f
Compare
…q4_0_4x4_q8_0() Signed-off-by: Adrien Gallouët <angt@huggingface.co>
6b6b98f
to
aaa6682
Compare
I ran several tests today with many models (like Just for your information, I also wrote the code to match the assembly version and not the C version, which slightly differs: C version:
NEON assembly:
|
Nice. Btw, to clarify that my #10567 (comment) was about the change that you suggested in #10567 (comment). Without this change (i.e. using the PR as it is), everything works on my end. If I apply the change, the output becomes incorrect. |
I tested this on M2 Max and Snapdragon X-Elite. |
…q4_0_4x4_q8_0() (ggml-org#10567) Signed-off-by: Adrien Gallouët <angt@huggingface.co>
This PR improves code readability and lays the groundwork for potential optimizations in the future.
For now, I have limited the changes to a single function to ensure this approach is OK for everyone.
I did not observe any significant performance differences using
llama-bench
.