ggml-cpu: replace AArch64 NEON assembly with intrinsics in ggml_gemv_q4_0_4x4_q8_0() #10567

angt · 2024-11-28T14:42:39Z

I have read the contributing guidelines
Self-reported review complexity:
- Low
- Medium
- High

This PR improves code readability and lays the groundwork for potential optimizations in the future.
For now, I have limited the changes to a single function to ensure this approach is OK for everyone.
I did not observe any significant performance differences using llama-bench.

ggerganov

Using intrinsics is definitely preferred.

I also don't observe significant performance change. Here are the results on M2 Ultra, using the following patch to force the Q4_0_4_4 repack:

diff --git a/ggml/src/ggml-cpu/ggml-cpu-aarch64.c b/ggml/src/ggml-cpu/ggml-cpu-aarch64.c
index 6d2c0adc3..cac45278b 100644
--- a/ggml/src/ggml-cpu/ggml-cpu-aarch64.c
+++ b/ggml/src/ggml-cpu/ggml-cpu-aarch64.c
@@ -3812,7 +3812,7 @@ enum ggml_type ggml_aarch64_get_optimal_repack_type(const struct ggml_tensor * c
             return GGML_TYPE_Q4_0_8_8;
         }
         if (ggml_cpu_has_neon() && ggml_cpu_has_matmul_int8()) {
-            return GGML_TYPE_Q4_0_4_8;
+            //return GGML_TYPE_Q4_0_4_8;
         }
         if (ggml_cpu_has_neon() && ggml_cpu_has_dotprod()) {
             return GGML_TYPE_Q4_0_4_4;

make -j && ./bin/llama-bench -m ../models/qwen2.5-1.5b-coder/ggml-model-q4_0.gguf -t 1,2,4,8,16 -p 0 -n 64 -fa 1

master

model	size	params	backend	threads	fa	test	t/s
qwen2 1.5B Q4_0	885.97 MiB	1.54 B	CPU	1	1	tg64	37.44 ± 0.19
qwen2 1.5B Q4_0	885.97 MiB	1.54 B	CPU	2	1	tg64	65.51 ± 0.19
qwen2 1.5B Q4_0	885.97 MiB	1.54 B	CPU	4	1	tg64	117.41 ± 0.35
qwen2 1.5B Q4_0	885.97 MiB	1.54 B	CPU	8	1	tg64	158.53 ± 0.30
qwen2 1.5B Q4_0	885.97 MiB	1.54 B	CPU	16	1	tg64	159.32 ± 1.79

build: 7281cf1 (4211)

PR

model	size	params	backend	threads	fa	test	t/s
qwen2 1.5B Q4_0	885.97 MiB	1.54 B	CPU	1	1	tg64	36.92 ± 0.12
qwen2 1.5B Q4_0	885.97 MiB	1.54 B	CPU	2	1	tg64	66.53 ± 0.55
qwen2 1.5B Q4_0	885.97 MiB	1.54 B	CPU	4	1	tg64	114.92 ± 1.64
qwen2 1.5B Q4_0	885.97 MiB	1.54 B	CPU	8	1	tg64	158.25 ± 0.17
qwen2 1.5B Q4_0	885.97 MiB	1.54 B	CPU	16	1	tg64	157.57 ± 4.43

build: bde1b96 (4210)

ggml/src/ggml-cpu/ggml-cpu-aarch64.c

angt · 2024-11-28T15:30:44Z

Also I replicated the current C/asm code but I would have done something more like this:

-                ret = vdotq_laneq_s32(ret, b0 << 4, a0, 0);
-                ret = vdotq_laneq_s32(ret, b1 << 4, a0, 1);
-                ret = vdotq_laneq_s32(ret, b2 << 4, a0, 2);
-                ret = vdotq_laneq_s32(ret, b3 << 4, a0, 3);
+                ret = vdotq_laneq_s32(ret, b0 >> 4, a0, 0);
+                ret = vdotq_laneq_s32(ret, b1 >> 4, a0, 1);
+                ret = vdotq_laneq_s32(ret, b2 >> 4, a0, 2);
+                ret = vdotq_laneq_s32(ret, b3 >> 4, a0, 3);
 
-                ret = vdotq_laneq_s32(ret, b0 & 0xf0U, a1, 0);
-                ret = vdotq_laneq_s32(ret, b1 & 0xf0U, a1, 1);
-                ret = vdotq_laneq_s32(ret, b2 & 0xf0U, a1, 2);
-                ret = vdotq_laneq_s32(ret, b3 & 0xf0U, a1, 3);
+                ret = vdotq_laneq_s32(ret, b0 & 0xfU, a1, 0);
+                ret = vdotq_laneq_s32(ret, b1 & 0xfU, a1, 1);
+                ret = vdotq_laneq_s32(ret, b2 & 0xfU, a1, 2);
+                ret = vdotq_laneq_s32(ret, b3 & 0xfU, a1, 3);
 
-                acc = vfmaq_f32(acc, vcvtq_n_f32_s32(ret, 4),
+                acc = vfmaq_f32(acc, vcvtq_f32_s32(ret),

If anyone has an explanation for why it was done this way, I'm interested.

ggerganov · 2024-11-28T15:39:49Z

This does not seem to produce correct output:

make -j && ./bin/llama-cli -m ../models/llama-3.2-3b-instruct/ggml-model-q4_0.gguf -p "I believe the meaning of life is" -n 32 -s 1

I believe the meaning of life is anekloatbounceракahnurnجوistle Scene ERA Ordinary spherecord Cheat Spherebenchbenchogle leaguesipogl ordin usefulnessecutenchhec batchjaxrigerrig

angt · 2024-11-28T16:03:13Z

I had only tested with Llama-3.2-1B-Instruct-Q4_0_4_4.gguf, and the output was the same:

I believe the meaning of life is the concept of finding one's purpose and fulfilling a person's potential.
The meaning of life is indeed a profound question that has puzzled philosophers, theologians, and

I'll dig later :)

angt · 2024-11-28T16:35:37Z

@ggerganov I believe the model you're using llama-3.2-3b-instruct/ggml-model-q4_0.gguf doesn't use the function ggml_gemv_q4_0_4x4_q8_0 modified by this PR but ggml_gemv_q4_0_4x8_q8_0.

This is what I get with the master branch:

./llama-cli -m /home/ubuntu/models/Llama-3.2-3B-Instruct-Q4_0.gguf -p I believe the meaning of life is -n 32 -s 1
I believe the meaning of life isเJOftionfortujiartuftuftuftipfmeapsujiathuIquiKpstuqf

Djip007 · 2024-11-28T19:58:05Z

If anyone has an explanation for why it was done this way, I'm interested.

// think of exemple
A = 0b1111nnnn;
A[hight] == Q4 = 0b1111
// because it use signed int4
A[hight] == -1
// now if you do:
A>>4 => 0b00001111
// result a value of +15
A&0xF0 => 0b11110000
// result a value of -16 = -1*2^4

so keep it like that if you don't want to deal with negative Q4

Note: hop I do not make to much error with my "math" 😎

…q4_0_4x4_q8_0() Signed-off-by: Adrien Gallouët <angt@huggingface.co>

angt · 2024-11-29T15:36:53Z

I ran several tests today with many models (like Qwen2.5-3B-Instruct-Q4_0_4_4.gguf and Llama-3.2-1B-Instruct-Q4_0_4_4.gguf on both the master branch and this one, and I confirm that the output is identical each time.

Just for your information, I also wrote the code to match the assembly version and not the C version, which slightly differs:

C version:

./llama-cli -m /home/ubuntu/models/Llama-3.2-1B-Instruct-Q4_0_4_4.gguf -p "I believe the meaning of life is" -n 48 -s 1
I believe the meaning of life is the concept of finding one's purpose and pursuing happiness through personal growth and self-improvement. It's the idea that, by overcoming and overcoming one's limitations, one can discover a sense of fulfillment and satisfaction that comes from achieving their goals

NEON assembly:

./llama-cli -m /home/ubuntu/models/Llama-3.2-1B-Instruct-Q4_0_4_4.gguf -p "I believe the meaning of life is" -n 48 -s 1
I believe the meaning of life is the concept of finding one's purpose and fulfilling a person's potential.
The meaning of life is indeed a profound question that has puzzled philosophers, theologians, and thinkers for centuries. While there may not be a single, definitive answer, I

ggerganov · 2024-11-29T18:07:06Z

Nice. Btw, to clarify that my #10567 (comment) was about the change that you suggested in #10567 (comment). Without this change (i.e. using the PR as it is), everything works on my end. If I apply the change, the output becomes incorrect.

max-krasnyansky · 2024-11-29T18:34:53Z

I tested this on M2 Max and Snapdragon X-Elite.
About the same performance numbers and the outputs match (given the same seed).

…q4_0_4x4_q8_0() (ggml-org#10567) Signed-off-by: Adrien Gallouët <angt@huggingface.co>

github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Nov 28, 2024

angt force-pushed the ggml-cpu-replace-aarch64-neon-assembly-with-intrinsics branch from 69dd941 to bde1b96 Compare November 28, 2024 14:53

ggerganov reviewed Nov 28, 2024

View reviewed changes

ggml/src/ggml-cpu/ggml-cpu-aarch64.c Outdated Show resolved Hide resolved

ggml/src/ggml-cpu/ggml-cpu-aarch64.c Outdated Show resolved Hide resolved

angt force-pushed the ggml-cpu-replace-aarch64-neon-assembly-with-intrinsics branch from bde1b96 to 7091a85 Compare November 28, 2024 15:23

angt force-pushed the ggml-cpu-replace-aarch64-neon-assembly-with-intrinsics branch 2 times, most recently from 7587b42 to 6b6b98f Compare November 29, 2024 10:36

ggml-cpu: replace AArch64 NEON assembly with intrinsics in ggml_gemv_…

aaa6682

…q4_0_4x4_q8_0() Signed-off-by: Adrien Gallouët <angt@huggingface.co>

angt force-pushed the ggml-cpu-replace-aarch64-neon-assembly-with-intrinsics branch from 6b6b98f to aaa6682 Compare November 29, 2024 14:59

slaren approved these changes Nov 30, 2024

View reviewed changes

max-krasnyansky merged commit 0c39f44 into ggml-org:master Nov 30, 2024
50 checks passed

smpurkis mentioned this pull request Dec 18, 2024

Misc. bug: Q4_0 with runtime repacking not working as expected (TYPE_Q4_0_4_4 REMOVED) #10757

Closed

arthw pushed a commit to arthw/llama.cpp that referenced this pull request Dec 20, 2024

ggml-cpu: replace AArch64 NEON assembly with intrinsics in ggml_gemv_…

465e6b7

…q4_0_4x4_q8_0() (ggml-org#10567) Signed-off-by: Adrien Gallouët <angt@huggingface.co>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ggml-cpu: replace AArch64 NEON assembly with intrinsics in ggml_gemv_q4_0_4x4_q8_0() #10567

ggml-cpu: replace AArch64 NEON assembly with intrinsics in ggml_gemv_q4_0_4x4_q8_0() #10567

angt commented Nov 28, 2024

ggerganov left a comment

angt commented Nov 28, 2024 •

edited by slaren

Loading

ggerganov commented Nov 28, 2024

angt commented Nov 28, 2024

angt commented Nov 28, 2024 •

edited

Loading

Djip007 commented Nov 28, 2024 •

edited

Loading

angt commented Nov 29, 2024

ggerganov commented Nov 29, 2024

max-krasnyansky commented Nov 29, 2024

ggml-cpu: replace AArch64 NEON assembly with intrinsics in ggml_gemv_q4_0_4x4_q8_0() #10567

ggml-cpu: replace AArch64 NEON assembly with intrinsics in ggml_gemv_q4_0_4x4_q8_0() #10567

Conversation

angt commented Nov 28, 2024

ggerganov left a comment

Choose a reason for hiding this comment

master

PR

angt commented Nov 28, 2024 • edited by slaren Loading

ggerganov commented Nov 28, 2024

angt commented Nov 28, 2024

angt commented Nov 28, 2024 • edited Loading

Djip007 commented Nov 28, 2024 • edited Loading

angt commented Nov 29, 2024

ggerganov commented Nov 29, 2024

max-krasnyansky commented Nov 29, 2024

angt commented Nov 28, 2024 •

edited by slaren

Loading

angt commented Nov 28, 2024 •

edited

Loading

Djip007 commented Nov 28, 2024 •

edited

Loading