Add initial AVX512 support for dot product on Linux #320

Ameobea · 2023-03-20T11:20:22Z

NOTE: I am seeing different outputs when running with these changes. They seem of equal quality, but this isn't something I observed when first testing this out on alpaca.cpp.

It's possible that some rounding behavior is happening slightly differently or something like that. If this is a dealbreaker, I can try to figure out what is causing the difference and check if it's possible to get rid of it.

Changes

Update Makefile to detect AVX512 support and add compiler flags if it's available
Add AVX512 impl based on existing AVX2 implementation, dot product on one 32-value block of 4-bit quantized ints at a time
Perform 8 bit -> 16 bit sign extension and multiply+add on 32 values at time instead of 16
Use built-in AVX512 horizontal reduce add to get sum at the end
Manual unrolling on inner dot product loop to reduce loop counter overhead

Performance Impact

I initially implemented this over on alpaca.cpp where I saw an ~10% speedup to inferrence.

Before:

main: mem per token = 14368644 bytes
main:     load time =   923.25 ms
main:   sample time =    85.94 ms
main:  predict time = 23502.37 ms / 92.17 ms per token
main:    total time = 24845.69 ms

After:

main: mem per token = 14368644 bytes
main:     load time =   928.89 ms
main:   sample time =    16.18 ms
main:  predict time =  5720.41 ms / 82.90 ms per token
main:    total time =  6982.89 ms

I was hoping for more, but some other stuff I tried like converting the bytesFromNibbles function to operate on two blocks at a time by using AVX512 were not successful.

* Update Makefile to detect AVX512 support and add compiler flags if it's available * Based on existing AVX2 implementation, dot product on one 32-value block of 4-bit quantized ints at a time * Perform 8 bit -> 16 bit sign extension and multiply+add on 32 values at time instead of 16 * Use built-in AVX512 horizontal reduce add to get sum at the end * Manual unrolling on inner dot product loop to reduce loop counter overhead

ggml.c

* Rename it to make it more clear that it's used for that dot product function

ggml.c

congdm · 2023-03-21T11:44:51Z

Thank you for your effort! It also works on Windows and gives a little boost on my i7-11700F, from ~208 ms/token to 195 ms/token or sometimes even 185 ms/token on Alpaca7B.

* Update Makefile to detect AVX512 support and add compiler flags if it's available * Based on existing AVX2 implementation, dot product on one 32-value block of 4-bit quantized ints at a time * Perform 8 bit -> 16 bit sign extension and multiply+add on 32 values at time instead of 16 * Use built-in AVX512 horizontal reduce add to get sum at the end * Manual unrolling on inner dot product loop to reduce loop counter overhead

Ameobea mentioned this pull request Mar 20, 2023

Add partial AVX512 Linux support for dot product on 4-bit quantized values antimatter15/alpaca.cpp#80

Closed

gjmulder added enhancement New feature or request performance Speed related topics labels Mar 20, 2023

sw reviewed Mar 20, 2023

View reviewed changes

ggml.c Outdated Show resolved Hide resolved

Ameobea added 3 commits March 20, 2023 15:55

Re-use existing bytesFromNibbles function

76af3f5

Split AVX512 process one block function out from inline

05f2f48

* Rename it to make it more clear that it's used for that dot product function

Move AVX512 dot product block helper closer to caller

db8d0f1

sw reviewed Mar 21, 2023

View reviewed changes

ggml.c Show resolved Hide resolved

sw approved these changes Mar 21, 2023

View reviewed changes

sw merged commit 2e664f1 into ggml-org:master Mar 21, 2023

sw mentioned this pull request Mar 21, 2023

Introduce structs for the q4 data blocks #356

Merged

dfyz mentioned this pull request Apr 15, 2023

≈65% speedup of the AVX-512 implementation of ggml_vec_dot_q4_0() #933

Merged

Bearsaerker mentioned this pull request Mar 12, 2025

Eval bug: Gemma 3 extremly slow prompt processing when using quantized kv cache. #12352

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add initial AVX512 support for dot product on Linux #320

Add initial AVX512 support for dot product on Linux #320

Ameobea commented Mar 20, 2023

congdm commented Mar 21, 2023

Add initial AVX512 support for dot product on Linux #320

Add initial AVX512 support for dot product on Linux #320

Conversation

Ameobea commented Mar 20, 2023

Changes

Performance Impact

congdm commented Mar 21, 2023