-
Notifications
You must be signed in to change notification settings - Fork 11k
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
gemma : use more bits for the token_embd.weight tensor #5650
Conversation
I imagine that for models that share the same tensor for |
#5651 as well |
I changed it as suggested. Did a couple of ppl runs with Gemma 2B:
For comparison, this is the PPL on
Also, here is the speed on M2 Ultra using different types for the tensor:
build: 488bd97 (2232) |
@ggerganov , FYI: llama-cpp-python does not work for gemma gguf either |
* gemma : use Q8_0 for the token_embd.weight tensor * llama : quantize token_embd.weight using output type (cherry picked from commit 96633ee) Signed-off-by: Jared Van Bortel <jared@nomic.ai>
* gemma : use Q8_0 for the token_embd.weight tensor * llama : quantize token_embd.weight using output type
* gemma : use Q8_0 for the token_embd.weight tensor * llama : quantize token_embd.weight using output type
Based on some anecdotal runs with Q4 quantizations, it seems that the quality of the generated responses is very sensitive to the type of the
token_embd.weight
tensor:Quantizing this tensor to
Q8_0
seems like a safe bet. Thought, there might be better strategies