Skip to content

ggml : add AVX support based on AVX2 code #1430

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Merged
merged 4 commits into from
May 14, 2023
Merged

Conversation

katsu560
Copy link
Contributor

I rereopen new PR on latest master.
I added AVX support based on AVX2 code to below functions.
static inline __m256i bytes_from_bits_32(const uint8_t * x)
static inline __m256i bytes_from_nibbles_32(const uint8_t * rsi)
static inline __m256 sum_i16_pairs_float(const __m128i xh, const __m128i xl)
static inline __m256 mul_sum_i8_pairs_float(const __m256i x, const __m256i y)
static void ggml_vec_dot_q4_1_q8_1(const int n, float * restrict s, const void * restrict vx, const void * restrict vy)
static void ggml_vec_dot_q5_0_q8_0(const int n, float * restrict s, const void * restrict vx, const void * restrict vy)
static void ggml_vec_dot_q5_1_q8_1(const int n, float * restrict s, const void * restrict vx, const void * restrict vy)
static void ggml_vec_dot_q8_0_q8_0(const int n, float * restrict s, const void * restrict vx, const void * restrict vy)

performance improved:
with q5_0 model: 1084790.97 ms -> 156907.19 ms
with q5_1 model: 1072326.98 ms -> 301364.32 ms

ggerganov and sw, please confirm this PR.

Copy link
Contributor

@sw sw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@katsu560 Thank you, this is working fine.

However, for ggml_vec_dot_q4_1_q8_1 and ggml_vec_dot_q8_0_q8_0, only a single line is different for AVX and AVX2, I think those should be merged.

Also please learn to use Git and PRs and don't open a new PR for every modification on the same topic.

@prusnak
Copy link
Collaborator

prusnak commented May 13, 2023

only a single line is different for AVX and AVX2, I think those should be merged.

It seems like you should be able to use the following trick if only one line differs:

#elif defined(__AVX__) || defined(__AVX2__)
//
// common code for both AVX1 and AVX2
//
#if defined(__AVX__)
//
// code specific for AVX1
//
#else // defined(__AVX2__)
//
// code specific for AVX2
//
#endif
//
// common code for both AVX1 and AVX2
//
#elif ...

@sw
Copy link
Contributor

sw commented May 13, 2023

I don't know if it's really worthwhile to add SIMD optimizations for the quantization of the Q4, Q5 formats. These were removed with #1405. I think we'd want optimized dot-product routines, and the Q8 quantizations.

@ggerganov
Copy link
Member

I don't know if it's really worthwhile to add SIMD optimizations for the quantization of the Q4, Q5 formats

Yes, let's not complicate the implementation with Q4 and Q5 SIMD quantize / dequantize for now.
We can SIMD-ify these at a later stage, when we are confident which quantization formats will remain

@katsu560
Copy link
Contributor Author

I pushed the code to merge AVX2/AVX code.
I'd like you to confirm and merge to your code.

Regarding the adding SIMD optimizations for the quantization of the Q4, Q5 formats,
I understood you withhold the merging these optimizations right now.

@katsu560 katsu560 requested a review from sw May 14, 2023 09:38
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants