Skip to content

The new QX_K_M quants are producing gibberish #1091

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Closed
BadisG opened this issue Jan 16, 2024 · 8 comments
Closed

The new QX_K_M quants are producing gibberish #1091

BadisG opened this issue Jan 16, 2024 · 8 comments

Comments

@BadisG
Copy link

BadisG commented Jan 16, 2024

Hello,

I was trying this model that has the new types of GGUF quants:
https://huggingface.co/NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO-GGUF/tree/main

And it's not working for Q5_K_M, it produce gibberish:
image

This bug was supposed to be fixed 2 days ago there:
ggml-org/llama.cpp#4927

And I was using your latest version, which was bumped 11 hours ago (v0.2.29), so I thought it would work but it doesn't.

@BadisG BadisG changed the title The new QX_K_M quants are producing glibberish The new QX_K_M quants are producing gibberish Jan 16, 2024
@IsNoobgrammer
Copy link

While inferencing on CPU only its giving correct results
image
while GPU offloading its giving gibberish result
image
I was using Q4_K_M GGUF file
With LlamaIndex as Wrapper

@LorenzoBoccaccia
Copy link

if you revert to 0.2.28 it seems to work so doesn't seem related to quantization

@xmaayy
Copy link

xmaayy commented Jan 16, 2024

Reproduction in colab

!pip install --upgrade --quiet sentence_transformers langchain faiss-cpu boto3  awscli
!CMAKE_ARGS="-DLLAMA_CUBLAS=on"  pip install --quiet  llama-cpp-python
!wget https://huggingface.co/TheBloke/phi-2-GGUF/resolve/main/phi-2.Q5_K_M.gguf
from llama_cpp import Llama
print("".join(["-"*10, "Llama.cpp", "-"*10]))
llm = Llama(model_path="phi-2.Q5_K_M.gguf")
output = llm(
      "What is the average flying veolicty", # Prompt
      max_tokens=32, # Generate up to 32 tokens
      stop=["Q:", "\n"], # Stop generating just before the model would generate a new question
      echo=True # Echo the prompt back in the output
) # Generate a completion, can also call create_completion
print(output['choices'])
print("".join(["-"*10, "Llama.cpp GPU", "-"*10]))
llm = Llama(model_path="phi-2.Q5_K_M.gguf", n_gpu_layers=35)
output = llm(
      "What is the average flying veolicty", # Prompt
      max_tokens=32, # Generate up to 32 tokens
      stop=["Q:", "\n"], # Stop generating just before the model would generate a new question
      echo=True # Echo the prompt back in the output
) # Generate a completion, can also call create_completion
print(output['choices'])
print("".join(["-"*10, "LangChain CPU", "-"*10]))
llm_cpu = LlamaCpp(
  model_path="phi-2.Q5_K_M.gguf",
  temperature=0.10,
  max_tokens=64,
  verbose=True,
  #n_gpu_layers=35,
  echo=True
)
print(llm_cpu("Say hello"))
print("".join(["-"*10, "LangChain GPU", "-"*10]))
llm_gpu = LlamaCpp(
  model_path="phi-2.Q5_K_M.gguf",
  temperature=0.10,
  max_tokens=64,
  verbose=True,
  n_gpu_layers=35,
  echo=True
)
print(llm_gpu("Say hello"))

Output:

----------Llama.cpp----------
AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | 
[{'text': 'What is the average flying veolicty in meters per second?', 'index': 0, 'logprobs': None, 'finish_reason': 'stop'}]
----------Llama.cpp GPU----------
AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | 
[{'text': 'What is the average flying veolictyGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG', 'index': 0, 'logprobs': None, 'finish_reason': 'length'}]
----------LangChain CPU----------
AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | 
 to the new and improved version of our website.
We’ve made some changes to make it easier for you to find what you need.
You can also contact us by email or phone.
If you have any questions, please don't hesitate to get in touch.
The UK government has announced a
----------LangChain GPU----------
AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | 
GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG

@iamlemec
Copy link
Contributor

Cross posting from #1089, but passing offload_kqv=True to Llama works for me.

@BadisG
Copy link
Author

BadisG commented Jan 17, 2024

It's working now with the v2 version, guess that their quants had a problem in the first place
https://huggingface.co/NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO-GGUF-v2/blob/main/Nous-Hermes-2-Mixtral-8x7B-DPO.Q5_K_M.gguf

@timtensor
Copy link

@IsNoobgrammer very curious how could you set up llama index and llama cpp python . I had tried similar approach for mixtral quantized and get gibberish answers . I posted a question in llama index repo . it would be great if you could provide some hints
run-llama/llama_index#10072

@IgorBeHolder
Copy link

if you revert to 0.2.28 it seems to work so doesn't seem related to quantization

thanks
it works for me

as on 2024-01-18:
N_GPU_LAYERS = -1 -> gibberish results
N_GPU_LAYERS=0 -> Ok

@abetlen
Copy link
Owner

abetlen commented Jan 25, 2024

@BadisG going to close for now, feel free to reopen if the issue wasn't resolved.

@abetlen abetlen closed this as completed Jan 25, 2024
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants