Skip to content

AVX2 optimization for vec_dot_q4_3_q8_0 and refactoring #1099

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Merged
merged 3 commits into from
Apr 22, 2023

Conversation

sw
Copy link
Contributor

@sw sw commented Apr 21, 2023

Apart from adding the AVX2 optimization for Q4_3, this refactors some commonly used intrinsic sequences into inline functions.

@slaren
Copy link
Member

slaren commented Apr 21, 2023

q4_3
42.94 seconds per pass - ETA 7.81 hours
prompt eval time = 54411.09 ms /   631 tokens (   86.23 ms per token) bs=512
prompt eval time = 59126.51 ms /   631 tokens (   93.70 ms per token) bs=8
       eval time = 41742.69 ms /   255 runs   (  163.70 ms per run)

q4_1
35.42 seconds per pass - ETA 6.45 hours
prompt eval time = 52762.72 ms /   631 tokens (   83.62 ms per token) bs=512
prompt eval time = 56287.43 ms /   631 tokens (   89.20 ms per token) bs=8
       eval time = 41024.48 ms /   255 runs   (  160.88 ms per run)

Except with perplexity the performance looks good compared to q4_1, not sure why there is a discrepancy there.

@ggerganov
Copy link
Member

ggerganov commented Apr 21, 2023

Before merging this: the current Q4_3 format / implementation is not very efficient with ARM NEON:

Time per token on M1 Pro:

  • Q4_0 : 48ms
  • Q4_1 : 55ms
  • Q4_2 : 48ms
  • Q4_3 : 87ms

I want to make it close to ~50-60 ms / token.
But I think we might have to change the format if the optimization from #1083 does not work out.

Will try to optimize this with highest priority, so we can decide on the final Q4_3 format

@sw
Copy link
Contributor Author

sw commented Apr 21, 2023

Well #1083 was a bit rushed IMO, but I tried to address the loose ends.

For the horizontal sum of ints, I could not see a difference in speed between @ikawrakow's original code and @pubby's suggestion which ended up as commented-out code. The latter is AVX2-only, while the original should also work on AVX.

@sw sw marked this pull request as draft April 21, 2023 16:14
@sw
Copy link
Contributor Author

sw commented Apr 21, 2023

Finally I don't think there is a speed difference in the horizontal sums. I have now finished the AVX optimization for quantize_row_q8_0, but I'm not sure I can trust the compiler to verify that I'm not accidentally using AVX2. It would be great if someone with an AVX-only machine could test this.

@sw sw marked this pull request as ready for review April 21, 2023 16:41
Copy link
Member

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Q4_3 format seems will remain unchanged as it is on master, so let's merge this.
If the AVX-only path has issues we will resolve later

After merge, will try to rebase #1109 and merge it as well

@ggerganov ggerganov merged commit c5aa5e5 into ggml-org:master Apr 22, 2023
@sw sw deleted the q43-avx branch April 22, 2023 07:56
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants