Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

ggml : make GeLU faster and more accurate on CPU #8878

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

jart
Copy link
Contributor

@jart jart commented Aug 5, 2024

This change makes GeLU go 8x faster on Intel, 3x faster on Apple Silicon, and 2x faster on Threadripper. It's the world's most popular activation function, crucial to models such as Whisper and Gemma. On those models, this change can have a noticeable improvement in performance. That's because GeLU is usually the most time-consuming op except for matrix multiplication

In addition to improving performance, this change also improves accuracy. On ARM64 and AMD64 systems, we no longer need to rely on a 16-bit lookup table. We're now using SIMD instead. The GeLU lookup table is still here except it's been converted from fp16 to bf16. This helps align inference more with training possibly, but it helps us avoid the two extra lookups into the fp16 table. Therefore this change should have a positive impact on performance for platforms like OpenPOWER and RISC-V too.

Due to the sensitive nature of activation functions, I encourage you all to evaluate its impact on model output before merging. Vectorizing GeLU required trading away a few ulp of worst case accuracy compared to libm. LLMs normally have limitless tolerance for errors, but due to the nature of tanhf() this is a case where even off by ones can cause user-visible changes in model output. It is my belief, based on my own personal experiments so far, that this code works well for llama.cpp, whisper.cpp, gemma, etc.

This software was developed by Mozilla Ocho and ARM Limited. It first appeared in llamafile which offers you llama.cpp / whisper.cpp / stable-diffusion.cpp with the most bleeding edge performance optimizations and binary distributability.

Benchmarks

Intel(R) Core(TM) i9-14900K (AVX2)

Performance is improved by 8x.
Accuracy is improved from 16 to 29 bits.
Yes GeLU has superior performance on the cheapest microprocessor.

master jart@meatball:~/llama.cpp$ make -j tests && ./tests/test-backend-ops -o GELU -b CPU perf

# BEFORE
Backend 1/1 (CPU)
  Backend name: CPU
  GELU(type=f32,ne_a=[128,10,10,10],v=0):   8192 runs -    93.04 us/run -     1000 kB/run -   10.25 GB/s
  GELU(type=f32,ne_a=[7,13,19,23],v=0):     8192 runs -    20.13 us/run -      310 kB/run -   14.72 GB/s
  GELU(type=f32,ne_a=[128,10,10,10],v=1):   4197 runs -    93.81 us/run -     1999 kB/run -   20.32 GB/s
  GELU(type=f32,ne_a=[7,13,19,23],v=1):     8191 runs -    20.17 us/run -      621 kB/run -   29.37 GB/s
  Backend CPU: OK

# AFTER
Backend 1/1 (CPU)
  Backend name: CPU
  GELU(type=f32,ne_a=[128,10,10,10],v=0):   8192 runs -    11.82 us/run -     1000 kB/run -   80.66 GB/s
  GELU(type=f32,ne_a=[7,13,19,23],v=0):     8192 runs -     4.28 us/run -      310 kB/run -   69.28 GB/s
  GELU(type=f32,ne_a=[128,10,10,10],v=1):   4197 runs -    11.65 us/run -     1999 kB/run -  163.68 GB/s
  GELU(type=f32,ne_a=[7,13,19,23],v=1):     8191 runs -     4.24 us/run -      621 kB/run -  139.82 GB/s
  Backend CPU: OK

Mac Studio M2 Ultra (ARM64)

Performance is improved by 3.1x.
Accuracy is improved from 16 to 29 bits.

master jart@studio:~/llama.cpp$ make -j32 tests LLAMA_NO_ACCELERATE=1 LLAMA_NO_METAL=1 && ./tests/test-backend-ops -o GELU -b CPU perf

# BEFORE
Backend 1/1 (CPU)
  Backend name: CPU
  GELU(type=f32,ne_a=[128,10,10,10],v=0):   8192 runs -   125.95 us/run -     1000 kB/run -    7.57 GB/s
  GELU(type=f32,ne_a=[7,13,19,23],v=0):     8192 runs -    14.57 us/run -      310 kB/run -   20.34 GB/s
  GELU(type=f32,ne_a=[128,10,10,10],v=1):   4197 runs -   120.00 us/run -     1999 kB/run -   15.89 GB/s
  GELU(type=f32,ne_a=[7,13,19,23],v=1):     8191 runs -    30.37 us/run -      621 kB/run -   19.51 GB/s
  Backend CPU: OK

# AFTER
Backend 1/1 (CPU)
  Backend name: CPU
  GELU(type=f32,ne_a=[128,10,10,10],v=0):   8192 runs -    40.93 us/run -     1000 kB/run -   23.30 GB/s
  GELU(type=f32,ne_a=[7,13,19,23],v=0):     8192 runs -     9.81 us/run -      310 kB/run -   30.22 GB/s
  GELU(type=f32,ne_a=[128,10,10,10],v=1):   4197 runs -    38.11 us/run -     1999 kB/run -   50.03 GB/s
  GELU(type=f32,ne_a=[7,13,19,23],v=1):     8191 runs -    10.55 us/run -      621 kB/run -   56.18 GB/s
  Backend CPU: OK

AMD Ryzen Threadripper PRO 7995WX (AVX512)

Performance is improved by 2.4x.
Accuracy is improved from 16 to 29 bits.

master jart@luna:~/llama.cpp$ make -j tests && ./tests/test-backend-ops -o GELU -b CPU perf

# BEFORE
Backend 1/1 (CPU)
  Backend name: CPU
  GELU(type=f32,ne_a=[128,10,10,10],v=0):   8192 runs -    27.69 us/run -     1000 kB/run -   34.44 GB/s
  GELU(type=f32,ne_a=[7,13,19,23],v=0):     8192 runs -    10.24 us/run -      310 kB/run -   28.93 GB/s
  GELU(type=f32,ne_a=[128,10,10,10],v=1):   4197 runs -    27.86 us/run -     1999 kB/run -   68.43 GB/s
  GELU(type=f32,ne_a=[7,13,19,23],v=1):     8191 runs -    10.20 us/run -      621 kB/run -   58.11 GB/s
  Backend CPU: OK

# AFTER
Backend 1/1 (CPU)
  Backend name: CPU
  GELU(type=f32,ne_a=[128,10,10,10],v=0):   8192 runs -    11.49 us/run -     1000 kB/run -   82.98 GB/s
  GELU(type=f32,ne_a=[7,13,19,23],v=0):     8192 runs -     7.90 us/run -      310 kB/run -   37.53 GB/s
  GELU(type=f32,ne_a=[128,10,10,10],v=1):   4197 runs -    11.92 us/run -     1999 kB/run -  159.99 GB/s
  GELU(type=f32,ne_a=[7,13,19,23],v=1):     8191 runs -     7.89 us/run -      621 kB/run -   75.07 GB/s
  Backend CPU: OK

@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Aug 5, 2024
@jart
Copy link
Contributor Author

jart commented Aug 5, 2024

Test failure appears to be unrelated. Something having to do with JSON. ./tests/test-backend-ops -b CPU perf works fine locally on my Mac Studio.

@JohannesGaessler
Copy link
Collaborator

JohannesGaessler commented Aug 5, 2024

What's the end-to-end speedup?

Test failure appears to be unrelated. Something having to do with JSON. ./tests/test-backend-ops -b CPU perf works fine locally on my Mac Studio.

According to the log the test failure is caused by tests/ggml-backend-ops. More specifically, a single Metal NMSE value is above the threshold for GELU: GELU(type=f32,ne_a=[7,13,19,23],v=0): [GELU] NMSE = 0.000001290 > 0.000000100 FAIL. Notably however the tests use unseeded random inputs so they may fail on some runs but not on other ones. And a threshold of $10^{-7}$ is very strict I think.

@lrvl
Copy link

lrvl commented Aug 5, 2024

Does llamafile 0.8.12 use this already?

@jart
Copy link
Contributor Author

jart commented Aug 5, 2024

Regarding that failure, I've rolled back the bf16 changes to the LUT just to play it safe.

@lrvl asks: Does llamafile 0.8.12 use this already?

No, you have to build it at HEAD. A release will be coming out shortly. See below for build instructions.

@JohannesGaessler What's the end-to-end speedup?

For tiny models it can be as high as 17% faster overall. Take for example whisper.cpp using ggml-tiny-q5_1.bin. I can build that with llamafile for the purposes of turning an Edgar Allen Poe wav file into txt as follows:

sudo apt install sox libsox-fmt-all
git clone https://github.com/mozilla-ocho/llamafile/
cd llamafile
wget -O whisper-tiny.en-q5_1.bin https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-tiny.en-q5_1.bin
wget https://archive.org/download/raven/raven_poe_64kb.mp3
sox raven_poe_64kb.mp3 -r 16k raven_poe_64kb.wav
make -j o//whisper.cpp
o//whisper.cpp/main -m whisper-tiny.en-q5_1.bin -f raven_poe_64kb.wav

And it takes 7.9 seconds. But if I pass the --trap flag to enable trapping math, then in llamafile these vectorized libm functions will be disabled.

o//whisper.cpp/main -m whisper-tiny.en-q5_1.bin -f raven_poe_64kb.wav --trap

And it takes 9.5 seconds. That's for full quality libm tanhf() GeLU. If I use the 16-bit approximation of the GeLU approximation that llama.cpp is currently using, then it does go a little faster: 9.2 seconds. The 128kb LUT actually goes a lot slower in practice than benchmarks would lead you to believe, I think possibly due to how it demolishes the L1 cache. The vectorized code helps quite a bit with performance on my machines. Plus it gives you near-full accuracy for free.

@ggerganov
Copy link
Member

Here are some end-to-end CPU results on my hardware with llama.cpp and whisper.cpp:

Gemma-2B

GGML_NO_METAL=1 ./scripts/compare-commits.sh master pr/8878 \
    -m models/gemma-2b/ggml-model-q4_0.gguf -r 10 -p 0 -t 4,8,12,16
CPU Model Threads Test t/s master t/s pr/8878 Speedup
M2 Ultra gemma 2B Q4_0 4 tg128 47.89 47.81 1.00
M2 Ultra gemma 2B Q4_0 8 tg128 71.46 71.09 0.99
M2 Ultra gemma 2B Q4_0 12 tg128 83.52 82.41 0.99
M2 Ultra gemma 2B Q4_0 16 tg128 90.55 90.10 1.00
./scripts/compare-commits.sh master pr/8878 \
    -m models/gemma-2b/ggml-model-q4_0.gguf -r 10 -p 0 -t 4,8,12,16
CPU Model Threads Test t/s master t/s pr/8878 Speedup
AMD Ryzen 9 5950X 16-Core gemma 2B Q4_0 4 tg128 17.27 17.35 1.00
AMD Ryzen 9 5950X 16-Core gemma 2B Q4_0 8 tg128 18.28 18.50 1.01
AMD Ryzen 9 5950X 16-Core gemma 2B Q4_0 12 tg128 18.38 18.54 1.01
AMD Ryzen 9 5950X 16-Core gemma 2B Q4_0 16 tg128 18.33 18.40 1.00

Whisper

GGML_NO_METAL=1 make -j && ./scripts/bench-all.sh 8
  • master
CPU Config Model Th FA Enc. Dec. Bch5 PP Commit
M2 Ultra NEON BLAS tiny 8 0 80.80 1.06 0.35 0.11 fe36c90
M2 Ultra NEON BLAS tiny-q5_0 8 0 80.81 0.76 0.34 0.11 fe36c90
M2 Ultra NEON BLAS tiny-q5_1 8 0 83.11 0.80 0.38 0.11 fe36c90
M2 Ultra NEON BLAS base 8 0 159.92 1.83 0.65 0.21 fe36c90
M2 Ultra NEON BLAS base-q5_0 8 0 160.57 1.32 0.63 0.21 fe36c90
M2 Ultra NEON BLAS base-q5_1 8 0 150.90 1.37 0.67 0.21 fe36c90
M2 Ultra NEON BLAS small 8 0 466.81 4.69 1.73 0.56 fe36c90
M2 Ultra NEON BLAS small-q5_0 8 0 462.21 3.29 1.63 0.55 fe36c90
M2 Ultra NEON BLAS small-q5_1 8 0 469.31 3.41 1.77 0.56 fe36c90
  • cherry-pick this PR in whisper.cpp
CPU Config Model Th FA Enc. Dec. Bch5 PP Commit
M2 Ultra NEON BLAS tiny 8 0 80.47 1.07 0.35 0.11 fe36c90
M2 Ultra NEON BLAS tiny-q5_0 8 0 79.52 0.77 0.35 0.11 fe36c90
M2 Ultra NEON BLAS tiny-q5_1 8 0 76.88 0.76 0.36 0.11 fe36c90
M2 Ultra NEON BLAS base 8 0 157.43 1.81 0.64 0.21 fe36c90
M2 Ultra NEON BLAS base-q5_0 8 0 157.88 1.29 0.62 0.21 fe36c90
M2 Ultra NEON BLAS base-q5_1 8 0 154.50 1.35 0.67 0.21 fe36c90
M2 Ultra NEON BLAS small 8 0 461.85 4.63 1.72 0.57 fe36c90
M2 Ultra NEON BLAS small-q5_0 8 0 450.27 3.25 1.62 0.56 fe36c90
M2 Ultra NEON BLAS small-q5_1 8 0 454.20 3.46 1.76 0.56 fe36c90

(These are times in ms (lower is better). Would have expected to see a slight gain in Dec and Bch5)

Overall, I don't see much difference in the end-to-end performance. I do observe the speed improvement of the individual GELU op measured by test-backend-ops perf, but it seems it's quite negligible to the overall model evaluation. @jart can you run Gemma-2B comparison on your hardware and report the numbers?

For tiny models it can be as high as 17% faster overall.

For measuring the performance of whisper.cpp I recommend to use the bench tool, or the scripts/bench-all.sh script as I did above. Measuring the total transcription time of auto-regressive models is meaningless. Apart from whisper.cpp I have never seen anyone else report the performance results correctly for Whisper

Comment on lines +2388 to +2385
if (!vpaddd_u64(vreinterpretq_u64_u32(special)))
return result;
return (float32x4_t){ special[0] ? tanhf(x[0]) : result[0],
special[1] ? tanhf(x[1]) : result[1],
special[2] ? tanhf(x[2]) : result[2],
special[3] ? tanhf(x[3]) : result[3] };
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we handle the special case as early as possible?

The AVX implementations don't handle the special case - would this be a problem?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The if() branch is more likely to be taken, so it helps the compiler optimize if that comes first.

The x86 code doesn't need the fallback. I tested every float to be sure. Here's the accuracy of the arm approximation:

           110850649x            1  ulp errors
            17787389x            2  ulp errors
               67537x            3  ulp errors
                 124x            4  ulp errors

Here's the accuracy of avx512, which doesn't have any fallback to libc tanhf().

           108638843x            1  ulp errors
            18129735x            2  ulp errors
              143921x            3  ulp errors
                 124x            4  ulp errors

So it's only a teensy tiny bit worse. That's well within my tolerances and it gets rid of a slowdown.

The arm code however can't survive without the libc fallback. If we remove it, we hit ugly cases 0.5% of the time.

   110850649x            1  ulp errors
    17787389x            2  ulp errors
       67537x            3  ulp errors
         124x            4  ulp errors
    16777215x          31+  ulp errors

Someone here who has time might want to take a closer look into why my x86 version of arm's algorithm is doing much better. I'm still relatively new to the arm architecture.

@mofosyne mofosyne added the Review Complexity : High Generally require indepth knowledge of LLMs or GPUs label Aug 6, 2024
This change makes GeLU more accurate on amd64 and arm64 by using a tanhf
approximation that's been explicitly vectorized for avx512f, avx2, sse2,
and neon. No performance is traded away on these architectures, compared
to the 16-bit lookup table that was being used previously. The impact of
this change can be demonstrated easily with whisper, where it leads to a
measurable improvement in levenshtein distance of model output.
@jart
Copy link
Contributor Author

jart commented Aug 18, 2024

If you're measuring the big picture performance impact of this change, you're only guaranteed to get a noticeable speedup if you do an apples-to-apples comparison of this vectorized tanhf() with the libc version. Your FP16 LUT already produces good output and it's very fast. We know from your backend test suite microbenchmarks that the vectorized version is faster though. There are many possible reasons why this performance improvement wouldn't manifest itself in your overall tokens per second. For example, it could be swallowed up by the thread synchronization gaps between ops caused by a suboptimal barrier implementation, which I'm planning to improve for you soon.

The most persuasive selling point for this change is its impact on output quality. I've measured the impact this change has on the whisper model's output when transcribing Edgar Allen Poe recordings. When using your tiny.en.q5_1 weights (31mb) on The Raven (from archive.org) the Levenshtein distance of whisper's output, compared against the Project Gutenberg text, improves from 0.853006 to 0.857595. That's 0.004589 better. Now compare this with the output quality of the medium.en (1.5gb) weights, which produce output with a Levenshtein distance 0.942247. Therefore, my vectorized GeLU function gives you 5% of the quality you'd expect if you were to upgrade from the tiny to medium models. So that 0.004589 delta matters a lot for a 1072 word poem.

I've force pushed the latest iteration of my ggml_vec_gelu_f32() function which now has smarter edge case handling. The tests I wrote for it a in llamafile's vmathf_test.cpp file. This more consistent edge case handling has improved my confidence of this implementation to the point where I now recommending deleting the old FP16 LUT code, just like we did before when I optimized SiLU.

PTAL.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
ggml changes relating to the ggml tensor library for machine learning Review Complexity : High Generally require indepth knowledge of LLMs or GPUs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants