Skip to content

Add initial AVX512 support for dot product on Linux #320

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Merged
merged 4 commits into from
Mar 21, 2023
Merged

Add initial AVX512 support for dot product on Linux #320

merged 4 commits into from
Mar 21, 2023

Conversation

Ameobea
Copy link
Contributor

@Ameobea Ameobea commented Mar 20, 2023

NOTE: I am seeing different outputs when running with these changes. They seem of equal quality, but this isn't something I observed when first testing this out on alpaca.cpp.

It's possible that some rounding behavior is happening slightly differently or something like that. If this is a dealbreaker, I can try to figure out what is causing the difference and check if it's possible to get rid of it.

Changes

  • Update Makefile to detect AVX512 support and add compiler flags if it's available
  • Add AVX512 impl based on existing AVX2 implementation, dot product on one 32-value block of 4-bit quantized ints at a time
  • Perform 8 bit -> 16 bit sign extension and multiply+add on 32 values at time instead of 16
  • Use built-in AVX512 horizontal reduce add to get sum at the end
  • Manual unrolling on inner dot product loop to reduce loop counter overhead

Performance Impact

I initially implemented this over on alpaca.cpp where I saw an ~10% speedup to inferrence.

Before:

main: mem per token = 14368644 bytes
main:     load time =   923.25 ms
main:   sample time =    85.94 ms
main:  predict time = 23502.37 ms / 92.17 ms per token
main:    total time = 24845.69 ms

After:

main: mem per token = 14368644 bytes
main:     load time =   928.89 ms
main:   sample time =    16.18 ms
main:  predict time =  5720.41 ms / 82.90 ms per token
main:    total time =  6982.89 ms

I was hoping for more, but some other stuff I tried like converting the bytesFromNibbles function to operate on two blocks at a time by using AVX512 were not successful.

 * Update Makefile to detect AVX512 support and add compiler flags if it's available
 * Based on existing AVX2 implementation, dot product on one 32-value block of 4-bit quantized ints at a time
 * Perform 8 bit -> 16 bit sign extension and multiply+add on 32 values at time instead of 16
 * Use built-in AVX512 horizontal reduce add to get sum at the end
 * Manual unrolling on inner dot product loop to reduce loop counter overhead
@congdm
Copy link

congdm commented Mar 21, 2023

Thank you for your effort! It also works on Windows and gives a little boost on my i7-11700F, from ~208 ms/token to 195 ms/token or sometimes even 185 ms/token on Alpaca7B.

@sw sw merged commit 2e664f1 into ggml-org:master Mar 21, 2023
mudler pushed a commit to go-skynet/llama that referenced this pull request Mar 21, 2023
 * Update Makefile to detect AVX512 support and add compiler flags if it's available
 * Based on existing AVX2 implementation, dot product on one 32-value block of 4-bit quantized ints at a time
 * Perform 8 bit -> 16 bit sign extension and multiply+add on 32 values at time instead of 16
 * Use built-in AVX512 horizontal reduce add to get sum at the end
 * Manual unrolling on inner dot product loop to reduce loop counter overhead
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
enhancement New feature or request performance Speed related topics
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants