-
Notifications
You must be signed in to change notification settings - Fork 11.4k
AVX2 optimization for vec_dot_q4_3_q8_0 and refactoring #1099
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Conversation
Except with perplexity the performance looks good compared to q4_1, not sure why there is a discrepancy there. |
Before merging this: the current Time per token on M1 Pro:
I want to make it close to ~50-60 ms / token. Will try to optimize this with highest priority, so we can decide on the final |
Well #1083 was a bit rushed IMO, but I tried to address the loose ends. For the horizontal sum of ints, I could not see a difference in speed between @ikawrakow's original code and @pubby's suggestion which ended up as commented-out code. The latter is AVX2-only, while the original should also work on AVX. |
Finally I don't think there is a speed difference in the horizontal sums. I have now finished the AVX optimization for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Q4_3
format seems will remain unchanged as it is on master
, so let's merge this.
If the AVX-only path has issues we will resolve later
After merge, will try to rebase #1109 and merge it as well
Apart from adding the AVX2 optimization for Q4_3, this refactors some commonly used intrinsic sequences into
inline
functions.