-
Notifications
You must be signed in to change notification settings - Fork 11.4k
ggml : add Q8_0 quantization format (rename the old one to Q8_1) (ARM NEON) #1179
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Conversation
For AVX2/AVX/scalar, we might want to keep I'm actually surprised that they're worth using on ARM NEON, as the alternative is simply subtracting 8 from the Q4 quants. |
@sw there is no noticeable difference difference between the two. Still, changed to use |
I guess it's not finished? You're using |
Wow - this is difficult 😄 I keep messing up something |
Looks good now; I think it's very slightly slower for Q4_0 and Q4_2 because we're now missing the SIMD optimizations for |
Ok, will merge now and we can finish the AVX stuff from |
8-bit integer quantization support
Perplexity:
5.9563