-
Notifications
You must be signed in to change notification settings - Fork 1.1k
The new QX_K_M quants are producing gibberish #1091
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Comments
if you revert to 0.2.28 it seems to work so doesn't seem related to quantization |
Reproduction in colab !pip install --upgrade --quiet sentence_transformers langchain faiss-cpu boto3 awscli
!CMAKE_ARGS="-DLLAMA_CUBLAS=on" pip install --quiet llama-cpp-python
!wget https://huggingface.co/TheBloke/phi-2-GGUF/resolve/main/phi-2.Q5_K_M.gguf
from llama_cpp import Llama
print("".join(["-"*10, "Llama.cpp", "-"*10]))
llm = Llama(model_path="phi-2.Q5_K_M.gguf")
output = llm(
"What is the average flying veolicty", # Prompt
max_tokens=32, # Generate up to 32 tokens
stop=["Q:", "\n"], # Stop generating just before the model would generate a new question
echo=True # Echo the prompt back in the output
) # Generate a completion, can also call create_completion
print(output['choices'])
print("".join(["-"*10, "Llama.cpp GPU", "-"*10]))
llm = Llama(model_path="phi-2.Q5_K_M.gguf", n_gpu_layers=35)
output = llm(
"What is the average flying veolicty", # Prompt
max_tokens=32, # Generate up to 32 tokens
stop=["Q:", "\n"], # Stop generating just before the model would generate a new question
echo=True # Echo the prompt back in the output
) # Generate a completion, can also call create_completion
print(output['choices'])
print("".join(["-"*10, "LangChain CPU", "-"*10]))
llm_cpu = LlamaCpp(
model_path="phi-2.Q5_K_M.gguf",
temperature=0.10,
max_tokens=64,
verbose=True,
#n_gpu_layers=35,
echo=True
)
print(llm_cpu("Say hello"))
print("".join(["-"*10, "LangChain GPU", "-"*10]))
llm_gpu = LlamaCpp(
model_path="phi-2.Q5_K_M.gguf",
temperature=0.10,
max_tokens=64,
verbose=True,
n_gpu_layers=35,
echo=True
)
print(llm_gpu("Say hello")) Output: ----------Llama.cpp----------
AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |
[{'text': 'What is the average flying veolicty in meters per second?', 'index': 0, 'logprobs': None, 'finish_reason': 'stop'}]
----------Llama.cpp GPU----------
AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |
[{'text': 'What is the average flying veolictyGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG', 'index': 0, 'logprobs': None, 'finish_reason': 'length'}]
----------LangChain CPU----------
AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |
to the new and improved version of our website.
We’ve made some changes to make it easier for you to find what you need.
You can also contact us by email or phone.
If you have any questions, please don't hesitate to get in touch.
The UK government has announced a
----------LangChain GPU----------
AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |
GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG |
Cross posting from #1089, but passing |
It's working now with the v2 version, guess that their quants had a problem in the first place |
@IsNoobgrammer very curious how could you set up llama index and llama cpp python . I had tried similar approach for mixtral quantized and get gibberish answers . I posted a question in llama index repo . it would be great if you could provide some hints |
thanks as on 2024-01-18: |
@BadisG going to close for now, feel free to reopen if the issue wasn't resolved. |
Hello,
I was trying this model that has the new types of GGUF quants:
https://huggingface.co/NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO-GGUF/tree/main
And it's not working for Q5_K_M, it produce gibberish:

This bug was supposed to be fixed 2 days ago there:
ggml-org/llama.cpp#4927
And I was using your latest version, which was bumped 11 hours ago (v0.2.29), so I thought it would work but it doesn't.
The text was updated successfully, but these errors were encountered: