Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

ggml-cpu: replace AArch64 NEON assembly with intrinsics in ggml_gemv_q4_0_4x4_q8_0() #10567

Conversation

angt
Copy link
Contributor

@angt angt commented Nov 28, 2024

This PR improves code readability and lays the groundwork for potential optimizations in the future.
For now, I have limited the changes to a single function to ensure this approach is OK for everyone.
I did not observe any significant performance differences using llama-bench.

@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Nov 28, 2024
@angt angt force-pushed the ggml-cpu-replace-aarch64-neon-assembly-with-intrinsics branch from 69dd941 to bde1b96 Compare November 28, 2024 14:53
Copy link
Member

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using intrinsics is definitely preferred.

I also don't observe significant performance change. Here are the results on M2 Ultra, using the following patch to force the Q4_0_4_4 repack:

diff --git a/ggml/src/ggml-cpu/ggml-cpu-aarch64.c b/ggml/src/ggml-cpu/ggml-cpu-aarch64.c
index 6d2c0adc3..cac45278b 100644
--- a/ggml/src/ggml-cpu/ggml-cpu-aarch64.c
+++ b/ggml/src/ggml-cpu/ggml-cpu-aarch64.c
@@ -3812,7 +3812,7 @@ enum ggml_type ggml_aarch64_get_optimal_repack_type(const struct ggml_tensor * c
             return GGML_TYPE_Q4_0_8_8;
         }
         if (ggml_cpu_has_neon() && ggml_cpu_has_matmul_int8()) {
-            return GGML_TYPE_Q4_0_4_8;
+            //return GGML_TYPE_Q4_0_4_8;
         }
         if (ggml_cpu_has_neon() && ggml_cpu_has_dotprod()) {
             return GGML_TYPE_Q4_0_4_4;
make -j && ./bin/llama-bench -m ../models/qwen2.5-1.5b-coder/ggml-model-q4_0.gguf -t 1,2,4,8,16 -p 0 -n 64 -fa 1

master

model size params backend threads fa test t/s
qwen2 1.5B Q4_0 885.97 MiB 1.54 B CPU 1 1 tg64 37.44 ± 0.19
qwen2 1.5B Q4_0 885.97 MiB 1.54 B CPU 2 1 tg64 65.51 ± 0.19
qwen2 1.5B Q4_0 885.97 MiB 1.54 B CPU 4 1 tg64 117.41 ± 0.35
qwen2 1.5B Q4_0 885.97 MiB 1.54 B CPU 8 1 tg64 158.53 ± 0.30
qwen2 1.5B Q4_0 885.97 MiB 1.54 B CPU 16 1 tg64 159.32 ± 1.79

build: 7281cf1 (4211)

PR

model size params backend threads fa test t/s
qwen2 1.5B Q4_0 885.97 MiB 1.54 B CPU 1 1 tg64 36.92 ± 0.12
qwen2 1.5B Q4_0 885.97 MiB 1.54 B CPU 2 1 tg64 66.53 ± 0.55
qwen2 1.5B Q4_0 885.97 MiB 1.54 B CPU 4 1 tg64 114.92 ± 1.64
qwen2 1.5B Q4_0 885.97 MiB 1.54 B CPU 8 1 tg64 158.25 ± 0.17
qwen2 1.5B Q4_0 885.97 MiB 1.54 B CPU 16 1 tg64 157.57 ± 4.43

build: bde1b96 (4210)

@angt angt force-pushed the ggml-cpu-replace-aarch64-neon-assembly-with-intrinsics branch from bde1b96 to 7091a85 Compare November 28, 2024 15:23
@angt
Copy link
Contributor Author

angt commented Nov 28, 2024

Also I replicated the current C/asm code but I would have done something more like this:

-                ret = vdotq_laneq_s32(ret, b0 << 4, a0, 0);
-                ret = vdotq_laneq_s32(ret, b1 << 4, a0, 1);
-                ret = vdotq_laneq_s32(ret, b2 << 4, a0, 2);
-                ret = vdotq_laneq_s32(ret, b3 << 4, a0, 3);
+                ret = vdotq_laneq_s32(ret, b0 >> 4, a0, 0);
+                ret = vdotq_laneq_s32(ret, b1 >> 4, a0, 1);
+                ret = vdotq_laneq_s32(ret, b2 >> 4, a0, 2);
+                ret = vdotq_laneq_s32(ret, b3 >> 4, a0, 3);
 
-                ret = vdotq_laneq_s32(ret, b0 & 0xf0U, a1, 0);
-                ret = vdotq_laneq_s32(ret, b1 & 0xf0U, a1, 1);
-                ret = vdotq_laneq_s32(ret, b2 & 0xf0U, a1, 2);
-                ret = vdotq_laneq_s32(ret, b3 & 0xf0U, a1, 3);
+                ret = vdotq_laneq_s32(ret, b0 & 0xfU, a1, 0);
+                ret = vdotq_laneq_s32(ret, b1 & 0xfU, a1, 1);
+                ret = vdotq_laneq_s32(ret, b2 & 0xfU, a1, 2);
+                ret = vdotq_laneq_s32(ret, b3 & 0xfU, a1, 3);
 
-                acc = vfmaq_f32(acc, vcvtq_n_f32_s32(ret, 4),
+                acc = vfmaq_f32(acc, vcvtq_f32_s32(ret),

If anyone has an explanation for why it was done this way, I'm interested.

@ggerganov
Copy link
Member

This does not seem to produce correct output:

make -j && ./bin/llama-cli -m ../models/llama-3.2-3b-instruct/ggml-model-q4_0.gguf -p "I believe the meaning of life is" -n 32 -s 1

I believe the meaning of life is anekloatbounceракahnurnجوistle Scene ERA Ordinary spherecord Cheat Spherebenchbenchogle leaguesipogl ordin usefulnessecutenchhec batchjaxrigerrig

@angt
Copy link
Contributor Author

angt commented Nov 28, 2024

I had only tested with Llama-3.2-1B-Instruct-Q4_0_4_4.gguf, and the output was the same:

I believe the meaning of life is the concept of finding one's purpose and fulfilling a person's potential.
The meaning of life is indeed a profound question that has puzzled philosophers, theologians, and

I'll dig later :)

@angt
Copy link
Contributor Author

angt commented Nov 28, 2024

@ggerganov I believe the model you're using llama-3.2-3b-instruct/ggml-model-q4_0.gguf doesn't use the function ggml_gemv_q4_0_4x4_q8_0 modified by this PR but ggml_gemv_q4_0_4x8_q8_0.

This is what I get with the master branch:

./llama-cli -m /home/ubuntu/models/Llama-3.2-3B-Instruct-Q4_0.gguf -p I believe the meaning of life is -n 32 -s 1
I believe the meaning of life isเJOftionfortujiartuftuftuftipfmeapsujiathuIquiKpstuqf

@Djip007
Copy link
Contributor

Djip007 commented Nov 28, 2024

If anyone has an explanation for why it was done this way, I'm interested.

// think of exemple
A = 0b1111nnnn;
A[hight] == Q4 = 0b1111
// because it use signed int4
A[hight] == -1
// now if you do:
A>>4 => 0b00001111
// result a value of +15
A&0xF0 => 0b11110000
// result a value of -16 = -1*2^4

so keep it like that if you don't want to deal with negative Q4

Note: hop I do not make to much error with my "math" 😎

@angt angt force-pushed the ggml-cpu-replace-aarch64-neon-assembly-with-intrinsics branch 2 times, most recently from 7587b42 to 6b6b98f Compare November 29, 2024 10:36
…q4_0_4x4_q8_0()

Signed-off-by: Adrien Gallouët <angt@huggingface.co>
@angt angt force-pushed the ggml-cpu-replace-aarch64-neon-assembly-with-intrinsics branch from 6b6b98f to aaa6682 Compare November 29, 2024 14:59
@angt
Copy link
Contributor Author

angt commented Nov 29, 2024

I ran several tests today with many models (like Qwen2.5-3B-Instruct-Q4_0_4_4.gguf and Llama-3.2-1B-Instruct-Q4_0_4_4.gguf on both the master branch and this one, and I confirm that the output is identical each time.

Just for your information, I also wrote the code to match the assembly version and not the C version, which slightly differs:

C version:

./llama-cli -m /home/ubuntu/models/Llama-3.2-1B-Instruct-Q4_0_4_4.gguf -p "I believe the meaning of life is" -n 48 -s 1
I believe the meaning of life is the concept of finding one's purpose and pursuing happiness through personal growth and self-improvement. It's the idea that, by overcoming and overcoming one's limitations, one can discover a sense of fulfillment and satisfaction that comes from achieving their goals

NEON assembly:

./llama-cli -m /home/ubuntu/models/Llama-3.2-1B-Instruct-Q4_0_4_4.gguf -p "I believe the meaning of life is" -n 48 -s 1
I believe the meaning of life is the concept of finding one's purpose and fulfilling a person's potential.
The meaning of life is indeed a profound question that has puzzled philosophers, theologians, and thinkers for centuries. While there may not be a single, definitive answer, I

@ggerganov
Copy link
Member

Nice. Btw, to clarify that my #10567 (comment) was about the change that you suggested in #10567 (comment). Without this change (i.e. using the PR as it is), everything works on my end. If I apply the change, the output becomes incorrect.

@max-krasnyansky
Copy link
Collaborator

I tested this on M2 Max and Snapdragon X-Elite.
About the same performance numbers and the outputs match (given the same seed).

@max-krasnyansky max-krasnyansky merged commit 0c39f44 into ggml-org:master Nov 30, 2024
50 checks passed
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Dec 20, 2024
…q4_0_4x4_q8_0() (ggml-org#10567)

Signed-off-by: Adrien Gallouët <angt@huggingface.co>
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
ggml changes relating to the ggml tensor library for machine learning
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants