ggml : make GeLU faster and more accurate on CPU #8878

jart · 2024-08-05T16:39:20Z

This change makes GeLU go 8x faster on Intel, 3x faster on Apple Silicon, and 2x faster on Threadripper. It's the world's most popular activation function, crucial to models such as Whisper and Gemma. On those models, this change can have a noticeable improvement in performance. That's because GeLU is usually the most time-consuming op except for matrix multiplication

In addition to improving performance, this change also improves accuracy. On ARM64 and AMD64 systems, we no longer need to rely on a 16-bit lookup table. We're now using SIMD instead. The GeLU lookup table is still here except it's been converted from fp16 to bf16. This helps align inference more with training possibly, but it helps us avoid the two extra lookups into the fp16 table. Therefore this change should have a positive impact on performance for platforms like OpenPOWER and RISC-V too.

Due to the sensitive nature of activation functions, I encourage you all to evaluate its impact on model output before merging. Vectorizing GeLU required trading away a few ulp of worst case accuracy compared to libm. LLMs normally have limitless tolerance for errors, but due to the nature of tanhf() this is a case where even off by ones can cause user-visible changes in model output. It is my belief, based on my own personal experiments so far, that this code works well for llama.cpp, whisper.cpp, gemma, etc.

This software was developed by Mozilla Ocho and ARM Limited. It first appeared in llamafile which offers you llama.cpp / whisper.cpp / stable-diffusion.cpp with the most bleeding edge performance optimizations and binary distributability.

Benchmarks

Intel(R) Core(TM) i9-14900K (AVX2)

Performance is improved by 8x.
Accuracy is improved from 16 to 29 bits.
Yes GeLU has superior performance on the cheapest microprocessor.

master jart@meatball:~/llama.cpp$ make -j tests && ./tests/test-backend-ops -o GELU -b CPU perf

# BEFORE
Backend 1/1 (CPU)
  Backend name: CPU
  GELU(type=f32,ne_a=[128,10,10,10],v=0):   8192 runs -    93.04 us/run -     1000 kB/run -   10.25 GB/s
  GELU(type=f32,ne_a=[7,13,19,23],v=0):     8192 runs -    20.13 us/run -      310 kB/run -   14.72 GB/s
  GELU(type=f32,ne_a=[128,10,10,10],v=1):   4197 runs -    93.81 us/run -     1999 kB/run -   20.32 GB/s
  GELU(type=f32,ne_a=[7,13,19,23],v=1):     8191 runs -    20.17 us/run -      621 kB/run -   29.37 GB/s
  Backend CPU: OK

# AFTER
Backend 1/1 (CPU)
  Backend name: CPU
  GELU(type=f32,ne_a=[128,10,10,10],v=0):   8192 runs -    11.82 us/run -     1000 kB/run -   80.66 GB/s
  GELU(type=f32,ne_a=[7,13,19,23],v=0):     8192 runs -     4.28 us/run -      310 kB/run -   69.28 GB/s
  GELU(type=f32,ne_a=[128,10,10,10],v=1):   4197 runs -    11.65 us/run -     1999 kB/run -  163.68 GB/s
  GELU(type=f32,ne_a=[7,13,19,23],v=1):     8191 runs -     4.24 us/run -      621 kB/run -  139.82 GB/s
  Backend CPU: OK

Mac Studio M2 Ultra (ARM64)

Performance is improved by 3.1x.
Accuracy is improved from 16 to 29 bits.

master jart@studio:~/llama.cpp$ make -j32 tests LLAMA_NO_ACCELERATE=1 LLAMA_NO_METAL=1 && ./tests/test-backend-ops -o GELU -b CPU perf

# BEFORE
Backend 1/1 (CPU)
  Backend name: CPU
  GELU(type=f32,ne_a=[128,10,10,10],v=0):   8192 runs -   125.95 us/run -     1000 kB/run -    7.57 GB/s
  GELU(type=f32,ne_a=[7,13,19,23],v=0):     8192 runs -    14.57 us/run -      310 kB/run -   20.34 GB/s
  GELU(type=f32,ne_a=[128,10,10,10],v=1):   4197 runs -   120.00 us/run -     1999 kB/run -   15.89 GB/s
  GELU(type=f32,ne_a=[7,13,19,23],v=1):     8191 runs -    30.37 us/run -      621 kB/run -   19.51 GB/s
  Backend CPU: OK

# AFTER
Backend 1/1 (CPU)
  Backend name: CPU
  GELU(type=f32,ne_a=[128,10,10,10],v=0):   8192 runs -    40.93 us/run -     1000 kB/run -   23.30 GB/s
  GELU(type=f32,ne_a=[7,13,19,23],v=0):     8192 runs -     9.81 us/run -      310 kB/run -   30.22 GB/s
  GELU(type=f32,ne_a=[128,10,10,10],v=1):   4197 runs -    38.11 us/run -     1999 kB/run -   50.03 GB/s
  GELU(type=f32,ne_a=[7,13,19,23],v=1):     8191 runs -    10.55 us/run -      621 kB/run -   56.18 GB/s
  Backend CPU: OK

AMD Ryzen Threadripper PRO 7995WX (AVX512)

Performance is improved by 2.4x.
Accuracy is improved from 16 to 29 bits.

master jart@luna:~/llama.cpp$ make -j tests && ./tests/test-backend-ops -o GELU -b CPU perf

# BEFORE
Backend 1/1 (CPU)
  Backend name: CPU
  GELU(type=f32,ne_a=[128,10,10,10],v=0):   8192 runs -    27.69 us/run -     1000 kB/run -   34.44 GB/s
  GELU(type=f32,ne_a=[7,13,19,23],v=0):     8192 runs -    10.24 us/run -      310 kB/run -   28.93 GB/s
  GELU(type=f32,ne_a=[128,10,10,10],v=1):   4197 runs -    27.86 us/run -     1999 kB/run -   68.43 GB/s
  GELU(type=f32,ne_a=[7,13,19,23],v=1):     8191 runs -    10.20 us/run -      621 kB/run -   58.11 GB/s
  Backend CPU: OK

# AFTER
Backend 1/1 (CPU)
  Backend name: CPU
  GELU(type=f32,ne_a=[128,10,10,10],v=0):   8192 runs -    11.49 us/run -     1000 kB/run -   82.98 GB/s
  GELU(type=f32,ne_a=[7,13,19,23],v=0):     8192 runs -     7.90 us/run -      310 kB/run -   37.53 GB/s
  GELU(type=f32,ne_a=[128,10,10,10],v=1):   4197 runs -    11.92 us/run -     1999 kB/run -  159.99 GB/s
  GELU(type=f32,ne_a=[7,13,19,23],v=1):     8191 runs -     7.89 us/run -      621 kB/run -   75.07 GB/s
  Backend CPU: OK

jart · 2024-08-05T17:42:36Z

Test failure appears to be unrelated. Something having to do with JSON. ./tests/test-backend-ops -b CPU perf works fine locally on my Mac Studio.

JohannesGaessler · 2024-08-05T18:01:31Z

What's the end-to-end speedup?

Test failure appears to be unrelated. Something having to do with JSON. ./tests/test-backend-ops -b CPU perf works fine locally on my Mac Studio.

According to the log the test failure is caused by tests/ggml-backend-ops. More specifically, a single Metal NMSE value is above the threshold for GELU: GELU(type=f32,ne_a=[7,13,19,23],v=0): [GELU] NMSE = 0.000001290 > 0.000000100 FAIL. Notably however the tests use unseeded random inputs so they may fail on some runs but not on other ones. And a threshold of $10^{-7}$ is very strict I think.

lrvl · 2024-08-05T18:39:11Z

Does llamafile 0.8.12 use this already?

jart · 2024-08-05T18:58:24Z

Regarding that failure, I've rolled back the bf16 changes to the LUT just to play it safe.

@lrvl asks: Does llamafile 0.8.12 use this already?

No, you have to build it at HEAD. A release will be coming out shortly. See below for build instructions.

@JohannesGaessler What's the end-to-end speedup?

For tiny models it can be as high as 17% faster overall. Take for example whisper.cpp using ggml-tiny-q5_1.bin. I can build that with llamafile for the purposes of turning an Edgar Allen Poe wav file into txt as follows:

sudo apt install sox libsox-fmt-all
git clone https://github.com/mozilla-ocho/llamafile/
cd llamafile
wget -O whisper-tiny.en-q5_1.bin https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-tiny.en-q5_1.bin
wget https://archive.org/download/raven/raven_poe_64kb.mp3
sox raven_poe_64kb.mp3 -r 16k raven_poe_64kb.wav
make -j o//whisper.cpp
o//whisper.cpp/main -m whisper-tiny.en-q5_1.bin -f raven_poe_64kb.wav

And it takes 7.9 seconds. But if I pass the --trap flag to enable trapping math, then in llamafile these vectorized libm functions will be disabled.

o//whisper.cpp/main -m whisper-tiny.en-q5_1.bin -f raven_poe_64kb.wav --trap

And it takes 9.5 seconds. That's for full quality libm tanhf() GeLU. If I use the 16-bit approximation of the GeLU approximation that llama.cpp is currently using, then it does go a little faster: 9.2 seconds. The 128kb LUT actually goes a lot slower in practice than benchmarks would lead you to believe, I think possibly due to how it demolishes the L1 cache. The vectorized code helps quite a bit with performance on my machines. Plus it gives you near-full accuracy for free.

ggerganov · 2024-08-06T07:11:26Z

Here are some end-to-end CPU results on my hardware with llama.cpp and whisper.cpp:

Gemma-2B

GGML_NO_METAL=1 ./scripts/compare-commits.sh master pr/8878 \
    -m models/gemma-2b/ggml-model-q4_0.gguf -r 10 -p 0 -t 4,8,12,16

CPU	Model	Threads	Test	t/s master	t/s pr/8878	Speedup
M2 Ultra	gemma 2B Q4_0	4	tg128	47.89	47.81	1.00
M2 Ultra	gemma 2B Q4_0	8	tg128	71.46	71.09	0.99
M2 Ultra	gemma 2B Q4_0	12	tg128	83.52	82.41	0.99
M2 Ultra	gemma 2B Q4_0	16	tg128	90.55	90.10	1.00

./scripts/compare-commits.sh master pr/8878 \
    -m models/gemma-2b/ggml-model-q4_0.gguf -r 10 -p 0 -t 4,8,12,16

CPU	Model	Threads	Test	t/s master	t/s pr/8878	Speedup
AMD Ryzen 9 5950X 16-Core	gemma 2B Q4_0	4	tg128	17.27	17.35	1.00
AMD Ryzen 9 5950X 16-Core	gemma 2B Q4_0	8	tg128	18.28	18.50	1.01
AMD Ryzen 9 5950X 16-Core	gemma 2B Q4_0	12	tg128	18.38	18.54	1.01
AMD Ryzen 9 5950X 16-Core	gemma 2B Q4_0	16	tg128	18.33	18.40	1.00

Whisper

GGML_NO_METAL=1 make -j && ./scripts/bench-all.sh 8

master

CPU	Config	Model	Th	Enc.	Dec.	Bch5	PP	Commit
M2 Ultra	NEON BLAS	tiny	8	80.80	1.06	0.35	0.11	fe36c90
M2 Ultra	NEON BLAS	tiny-q5_0	8	80.81	0.76	0.34	0.11	fe36c90
M2 Ultra	NEON BLAS	tiny-q5_1	8	83.11	0.80	0.38	0.11	fe36c90
M2 Ultra	NEON BLAS	base	8	159.92	1.83	0.65	0.21	fe36c90
M2 Ultra	NEON BLAS	base-q5_0	8	160.57	1.32	0.63	0.21	fe36c90
M2 Ultra	NEON BLAS	base-q5_1	8	150.90	1.37	0.67	0.21	fe36c90
M2 Ultra	NEON BLAS	small	8	466.81	4.69	1.73	0.56	fe36c90
M2 Ultra	NEON BLAS	small-q5_0	8	462.21	3.29	1.63	0.55	fe36c90
M2 Ultra	NEON BLAS	small-q5_1	8	469.31	3.41	1.77	0.56	fe36c90

cherry-pick this PR in whisper.cpp

CPU	Config	Model	Th	Enc.	Dec.	Bch5	PP	Commit
M2 Ultra	NEON BLAS	tiny	8	80.47	1.07	0.35	0.11	fe36c90
M2 Ultra	NEON BLAS	tiny-q5_0	8	79.52	0.77	0.35	0.11	fe36c90
M2 Ultra	NEON BLAS	tiny-q5_1	8	76.88	0.76	0.36	0.11	fe36c90
M2 Ultra	NEON BLAS	base	8	157.43	1.81	0.64	0.21	fe36c90
M2 Ultra	NEON BLAS	base-q5_0	8	157.88	1.29	0.62	0.21	fe36c90
M2 Ultra	NEON BLAS	base-q5_1	8	154.50	1.35	0.67	0.21	fe36c90
M2 Ultra	NEON BLAS	small	8	461.85	4.63	1.72	0.57	fe36c90
M2 Ultra	NEON BLAS	small-q5_0	8	450.27	3.25	1.62	0.56	fe36c90
M2 Ultra	NEON BLAS	small-q5_1	8	454.20	3.46	1.76	0.56	fe36c90

(These are times in ms (lower is better). Would have expected to see a slight gain in Dec and Bch5)

Overall, I don't see much difference in the end-to-end performance. I do observe the speed improvement of the individual GELU op measured by test-backend-ops perf, but it seems it's quite negligible to the overall model evaluation. @jart can you run Gemma-2B comparison on your hardware and report the numbers?

For tiny models it can be as high as 17% faster overall.

For measuring the performance of whisper.cpp I recommend to use the bench tool, or the scripts/bench-all.sh script as I did above. Measuring the total transcription time of auto-regressive models is meaningless. Apart from whisper.cpp I have never seen anyone else report the performance results correctly for Whisper

ggerganov · 2024-08-06T07:12:54Z

ggml/src/ggml.c

+    if (!vpaddd_u64(vreinterpretq_u64_u32(special)))
+        return result;
+    return (float32x4_t){ special[0] ? tanhf(x[0]) : result[0],
+                          special[1] ? tanhf(x[1]) : result[1],
+                          special[2] ? tanhf(x[2]) : result[2],
+                          special[3] ? tanhf(x[3]) : result[3] };


Should we handle the special case as early as possible?

The AVX implementations don't handle the special case - would this be a problem?

The if() branch is more likely to be taken, so it helps the compiler optimize if that comes first.

The x86 code doesn't need the fallback. I tested every float to be sure. Here's the accuracy of the arm approximation:

110850649x 1 ulp errors 17787389x 2 ulp errors 67537x 3 ulp errors 124x 4 ulp errors

Here's the accuracy of avx512, which doesn't have any fallback to libc tanhf().

108638843x 1 ulp errors 18129735x 2 ulp errors 143921x 3 ulp errors 124x 4 ulp errors

So it's only a teensy tiny bit worse. That's well within my tolerances and it gets rid of a slowdown.

The arm code however can't survive without the libc fallback. If we remove it, we hit ugly cases 0.5% of the time.

110850649x 1 ulp errors 17787389x 2 ulp errors 67537x 3 ulp errors 124x 4 ulp errors 16777215x 31+ ulp errors

Someone here who has time might want to take a closer look into why my x86 version of arm's algorithm is doing much better. I'm still relatively new to the arm architecture.

This change makes GeLU more accurate on amd64 and arm64 by using a tanhf approximation that's been explicitly vectorized for avx512f, avx2, sse2, and neon. No performance is traded away on these architectures, compared to the 16-bit lookup table that was being used previously. The impact of this change can be demonstrated easily with whisper, where it leads to a measurable improvement in levenshtein distance of model output.

jart · 2024-08-18T13:37:51Z

If you're measuring the big picture performance impact of this change, you're only guaranteed to get a noticeable speedup if you do an apples-to-apples comparison of this vectorized tanhf() with the libc version. Your FP16 LUT already produces good output and it's very fast. We know from your backend test suite microbenchmarks that the vectorized version is faster though. There are many possible reasons why this performance improvement wouldn't manifest itself in your overall tokens per second. For example, it could be swallowed up by the thread synchronization gaps between ops caused by a suboptimal barrier implementation, which I'm planning to improve for you soon.

The most persuasive selling point for this change is its impact on output quality. I've measured the impact this change has on the whisper model's output when transcribing Edgar Allen Poe recordings. When using your tiny.en.q5_1 weights (31mb) on The Raven (from archive.org) the Levenshtein distance of whisper's output, compared against the Project Gutenberg text, improves from 0.853006 to 0.857595. That's 0.004589 better. Now compare this with the output quality of the medium.en (1.5gb) weights, which produce output with a Levenshtein distance 0.942247. Therefore, my vectorized GeLU function gives you 5% of the quality you'd expect if you were to upgrade from the tiny to medium models. So that 0.004589 delta matters a lot for a 1072 word poem.

I've force pushed the latest iteration of my ggml_vec_gelu_f32() function which now has smarter edge case handling. The tests I wrote for it a in llamafile's vmathf_test.cpp file. This more consistent edge case handling has improved my confidence of this implementation to the point where I now recommending deleting the old FP16 LUT code, just like we did before when I optimized SiLU.

PTAL.

github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Aug 5, 2024

jart force-pushed the gelu branch from 12e2ebc to 8860b7a Compare August 5, 2024 18:36

ggerganov reviewed Aug 6, 2024

View reviewed changes

Green-Sky mentioned this pull request Aug 6, 2024

How much inaccuracy/difference from pytorch is to be expected? ggml-org/ggml#915

Closed

mofosyne added the Review Complexity : High Generally require indepth knowledge of LLMs or GPUs label Aug 6, 2024

jart force-pushed the gelu branch from 8860b7a to bb668b6 Compare August 18, 2024 12:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ggml : make GeLU faster and more accurate on CPU #8878

ggml : make GeLU faster and more accurate on CPU #8878

jart commented Aug 5, 2024

jart commented Aug 5, 2024

JohannesGaessler commented Aug 5, 2024 •

edited

Loading

lrvl commented Aug 5, 2024

jart commented Aug 5, 2024

ggerganov commented Aug 6, 2024

ggerganov Aug 6, 2024

jart Aug 18, 2024

jart commented Aug 18, 2024

ggml : make GeLU faster and more accurate on CPU #8878

Are you sure you want to change the base?

ggml : make GeLU faster and more accurate on CPU #8878

Conversation

jart commented Aug 5, 2024

Benchmarks

Intel(R) Core(TM) i9-14900K (AVX2)

Mac Studio M2 Ultra (ARM64)

AMD Ryzen Threadripper PRO 7995WX (AVX512)

jart commented Aug 5, 2024

JohannesGaessler commented Aug 5, 2024 • edited Loading

lrvl commented Aug 5, 2024

jart commented Aug 5, 2024

ggerganov commented Aug 6, 2024

Gemma-2B

Whisper

ggerganov Aug 6, 2024

Choose a reason for hiding this comment

jart Aug 18, 2024

Choose a reason for hiding this comment

jart commented Aug 18, 2024

JohannesGaessler commented Aug 5, 2024 •

edited

Loading