Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Tinygrad Quantization Support [WIP] #630

Draft
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

KhanerX
Copy link

@KhanerX KhanerX commented Jan 25, 2025

What I did:

  1. Define custom layers for Affine Quantized models, including integer weights, float16 scales and biases (zero point correction)
  2. Load MLX-Community quantized model and unpack the weights.
  3. Write the forward logic for quantized layers, following this paper (see section 2.3)

Todo:

  • write tests, test with multiple nodes and different llama models
  • support 4-bit quantization
  • do the forward math in Integer (see section 2.2 of mentioned paper)

With this first commit, you can run exo --run-model="llama-3.2-1b-8bit" with tinygrad backend and "mlx-community" model.

@KhanerX
Copy link
Author

KhanerX commented Jan 25, 2025

Also, I'm doing math in float32 right now, which adds overhead. when I change it to float16, I think something overflows and model outputs nothing. I will fix this.

@AlexCheema
Copy link
Contributor

This is a great start - I tested this and it works.
That's awesome because that means we can support any mlx model in tinygrad.

Are you sending parameters in float32 to the GPU? Or are they being sent in fp8? Just wondering what kind of speed to expect here. Wonder how close this gets to MLX quantized

@varshith15
Copy link
Contributor

varshith15 commented Feb 5, 2025

tested it out on m3 pro for mlx-community/Llama-3.2-1B-Instruct-8bit

old pr:
Screenshot 2025-02-05 at 3 35 45 PM

new_pr:
Screenshot 2025-02-05 at 3 23 48 PM

mlx:
Screenshot 2025-02-05 at 3 25 18 PM

just 2.5x slower now 🚀🚀

@KhanerX @AlexCheema

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants