-
Notifications
You must be signed in to change notification settings - Fork 10.9k
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
ggml : make GeLU faster and more accurate on CPU #8878
base: master
Are you sure you want to change the base?
Conversation
Test failure appears to be unrelated. Something having to do with JSON. |
What's the end-to-end speedup?
According to the log the test failure is caused by |
Does llamafile 0.8.12 use this already? |
Regarding that failure, I've rolled back the bf16 changes to the LUT just to play it safe.
No, you have to build it at HEAD. A release will be coming out shortly. See below for build instructions.
For tiny models it can be as high as 17% faster overall. Take for example whisper.cpp using ggml-tiny-q5_1.bin. I can build that with llamafile for the purposes of turning an Edgar Allen Poe wav file into txt as follows:
And it takes 7.9 seconds. But if I pass the
And it takes 9.5 seconds. That's for full quality libm tanhf() GeLU. If I use the 16-bit approximation of the GeLU approximation that llama.cpp is currently using, then it does go a little faster: 9.2 seconds. The 128kb LUT actually goes a lot slower in practice than benchmarks would lead you to believe, I think possibly due to how it demolishes the L1 cache. The vectorized code helps quite a bit with performance on my machines. Plus it gives you near-full accuracy for free. |
Here are some end-to-end CPU results on my hardware with Gemma-2BGGML_NO_METAL=1 ./scripts/compare-commits.sh master pr/8878 \
-m models/gemma-2b/ggml-model-q4_0.gguf -r 10 -p 0 -t 4,8,12,16
./scripts/compare-commits.sh master pr/8878 \
-m models/gemma-2b/ggml-model-q4_0.gguf -r 10 -p 0 -t 4,8,12,16
WhisperGGML_NO_METAL=1 make -j && ./scripts/bench-all.sh 8
(These are times in Overall, I don't see much difference in the end-to-end performance. I do observe the speed improvement of the individual GELU op measured by
For measuring the performance of |
if (!vpaddd_u64(vreinterpretq_u64_u32(special))) | ||
return result; | ||
return (float32x4_t){ special[0] ? tanhf(x[0]) : result[0], | ||
special[1] ? tanhf(x[1]) : result[1], | ||
special[2] ? tanhf(x[2]) : result[2], | ||
special[3] ? tanhf(x[3]) : result[3] }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we handle the special case as early as possible?
The AVX implementations don't handle the special case - would this be a problem?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The if() branch is more likely to be taken, so it helps the compiler optimize if that comes first.
The x86 code doesn't need the fallback. I tested every float to be sure. Here's the accuracy of the arm approximation:
110850649x 1 ulp errors
17787389x 2 ulp errors
67537x 3 ulp errors
124x 4 ulp errors
Here's the accuracy of avx512, which doesn't have any fallback to libc tanhf().
108638843x 1 ulp errors
18129735x 2 ulp errors
143921x 3 ulp errors
124x 4 ulp errors
So it's only a teensy tiny bit worse. That's well within my tolerances and it gets rid of a slowdown.
The arm code however can't survive without the libc fallback. If we remove it, we hit ugly cases 0.5% of the time.
110850649x 1 ulp errors
17787389x 2 ulp errors
67537x 3 ulp errors
124x 4 ulp errors
16777215x 31+ ulp errors
Someone here who has time might want to take a closer look into why my x86 version of arm's algorithm is doing much better. I'm still relatively new to the arm architecture.
This change makes GeLU more accurate on amd64 and arm64 by using a tanhf approximation that's been explicitly vectorized for avx512f, avx2, sse2, and neon. No performance is traded away on these architectures, compared to the 16-bit lookup table that was being used previously. The impact of this change can be demonstrated easily with whisper, where it leads to a measurable improvement in levenshtein distance of model output.
If you're measuring the big picture performance impact of this change, you're only guaranteed to get a noticeable speedup if you do an apples-to-apples comparison of this vectorized tanhf() with the libc version. Your FP16 LUT already produces good output and it's very fast. We know from your backend test suite microbenchmarks that the vectorized version is faster though. There are many possible reasons why this performance improvement wouldn't manifest itself in your overall tokens per second. For example, it could be swallowed up by the thread synchronization gaps between ops caused by a suboptimal barrier implementation, which I'm planning to improve for you soon. The most persuasive selling point for this change is its impact on output quality. I've measured the impact this change has on the whisper model's output when transcribing Edgar Allen Poe recordings. When using your tiny.en.q5_1 weights (31mb) on The Raven (from archive.org) the Levenshtein distance of whisper's output, compared against the Project Gutenberg text, improves from 0.853006 to 0.857595. That's 0.004589 better. Now compare this with the output quality of the medium.en (1.5gb) weights, which produce output with a Levenshtein distance 0.942247. Therefore, my vectorized GeLU function gives you 5% of the quality you'd expect if you were to upgrade from the tiny to medium models. So that 0.004589 delta matters a lot for a 1072 word poem. I've force pushed the latest iteration of my ggml_vec_gelu_f32() function which now has smarter edge case handling. The tests I wrote for it a in llamafile's vmathf_test.cpp file. This more consistent edge case handling has improved my confidence of this implementation to the point where I now recommending deleting the old FP16 LUT code, just like we did before when I optimized SiLU. PTAL. |
This change makes GeLU go 8x faster on Intel, 3x faster on Apple Silicon, and 2x faster on Threadripper. It's the world's most popular activation function, crucial to models such as Whisper and Gemma. On those models, this change can have a noticeable improvement in performance. That's because GeLU is usually the most time-consuming op except for matrix multiplication
In addition to improving performance, this change also improves accuracy. On ARM64 and AMD64 systems, we no longer need to rely on a 16-bit lookup table. We're now using SIMD instead. The GeLU lookup table is still here except it's been converted from fp16 to bf16. This helps align inference more with training possibly, but it helps us avoid the two extra lookups into the fp16 table. Therefore this change should have a positive impact on performance for platforms like OpenPOWER and RISC-V too.
Due to the sensitive nature of activation functions, I encourage you all to evaluate its impact on model output before merging. Vectorizing GeLU required trading away a few ulp of worst case accuracy compared to libm. LLMs normally have limitless tolerance for errors, but due to the nature of tanhf() this is a case where even off by ones can cause user-visible changes in model output. It is my belief, based on my own personal experiments so far, that this code works well for llama.cpp, whisper.cpp, gemma, etc.
This software was developed by Mozilla Ocho and ARM Limited. It first appeared in llamafile which offers you llama.cpp / whisper.cpp / stable-diffusion.cpp with the most bleeding edge performance optimizations and binary distributability.
Benchmarks
Intel(R) Core(TM) i9-14900K (AVX2)
Performance is improved by 8x.
Accuracy is improved from 16 to 29 bits.
Yes GeLU has superior performance on the cheapest microprocessor.
Mac Studio M2 Ultra (ARM64)
Performance is improved by 3.1x.
Accuracy is improved from 16 to 29 bits.
AMD Ryzen Threadripper PRO 7995WX (AVX512)
Performance is improved by 2.4x.
Accuracy is improved from 16 to 29 bits.