The new QX_K_M quants are producing gibberish #1091

BadisG · 2024-01-16T06:32:54Z

Hello,

I was trying this model that has the new types of GGUF quants:
https://huggingface.co/NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO-GGUF/tree/main

And it's not working for Q5_K_M, it produce gibberish:

This bug was supposed to be fixed 2 days ago there:
ggml-org/llama.cpp#4927

And I was using your latest version, which was bumped 11 hours ago (v0.2.29), so I thought it would work but it doesn't.

IsNoobgrammer · 2024-01-16T07:34:41Z

While inferencing on CPU only its giving correct results

while GPU offloading its giving gibberish result

I was using Q4_K_M GGUF file
With LlamaIndex as Wrapper

LorenzoBoccaccia · 2024-01-16T10:11:15Z

if you revert to 0.2.28 it seems to work so doesn't seem related to quantization

xmaayy · 2024-01-16T13:03:11Z

Reproduction in colab

!pip install --upgrade --quiet sentence_transformers langchain faiss-cpu boto3  awscli
!CMAKE_ARGS="-DLLAMA_CUBLAS=on"  pip install --quiet  llama-cpp-python
!wget https://huggingface.co/TheBloke/phi-2-GGUF/resolve/main/phi-2.Q5_K_M.gguf
from llama_cpp import Llama
print("".join(["-"*10, "Llama.cpp", "-"*10]))
llm = Llama(model_path="phi-2.Q5_K_M.gguf")
output = llm(
      "What is the average flying veolicty", # Prompt
      max_tokens=32, # Generate up to 32 tokens
      stop=["Q:", "\n"], # Stop generating just before the model would generate a new question
      echo=True # Echo the prompt back in the output
) # Generate a completion, can also call create_completion
print(output['choices'])
print("".join(["-"*10, "Llama.cpp GPU", "-"*10]))
llm = Llama(model_path="phi-2.Q5_K_M.gguf", n_gpu_layers=35)
output = llm(
      "What is the average flying veolicty", # Prompt
      max_tokens=32, # Generate up to 32 tokens
      stop=["Q:", "\n"], # Stop generating just before the model would generate a new question
      echo=True # Echo the prompt back in the output
) # Generate a completion, can also call create_completion
print(output['choices'])
print("".join(["-"*10, "LangChain CPU", "-"*10]))
llm_cpu = LlamaCpp(
  model_path="phi-2.Q5_K_M.gguf",
  temperature=0.10,
  max_tokens=64,
  verbose=True,
  #n_gpu_layers=35,
  echo=True
)
print(llm_cpu("Say hello"))
print("".join(["-"*10, "LangChain GPU", "-"*10]))
llm_gpu = LlamaCpp(
  model_path="phi-2.Q5_K_M.gguf",
  temperature=0.10,
  max_tokens=64,
  verbose=True,
  n_gpu_layers=35,
  echo=True
)
print(llm_gpu("Say hello"))

Output:

----------Llama.cpp----------
AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | 
[{'text': 'What is the average flying veolicty in meters per second?', 'index': 0, 'logprobs': None, 'finish_reason': 'stop'}]
----------Llama.cpp GPU----------
AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | 
[{'text': 'What is the average flying veolictyGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG', 'index': 0, 'logprobs': None, 'finish_reason': 'length'}]
----------LangChain CPU----------
AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | 
 to the new and improved version of our website.
We’ve made some changes to make it easier for you to find what you need.
You can also contact us by email or phone.
If you have any questions, please don't hesitate to get in touch.
The UK government has announced a
----------LangChain GPU----------
AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | 
GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG

iamlemec · 2024-01-16T19:22:27Z

Cross posting from #1089, but passing offload_kqv=True to Llama works for me.

BadisG · 2024-01-17T01:31:55Z

It's working now with the v2 version, guess that their quants had a problem in the first place
https://huggingface.co/NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO-GGUF-v2/blob/main/Nous-Hermes-2-Mixtral-8x7B-DPO.Q5_K_M.gguf

timtensor · 2024-01-17T21:45:14Z

@IsNoobgrammer very curious how could you set up llama index and llama cpp python . I had tried similar approach for mixtral quantized and get gibberish answers . I posted a question in llama index repo . it would be great if you could provide some hints
run-llama/llama_index#10072

IgorBeHolder · 2024-01-18T05:47:25Z

if you revert to 0.2.28 it seems to work so doesn't seem related to quantization

thanks
it works for me

as on 2024-01-18:
N_GPU_LAYERS = -1 -> gibberish results
N_GPU_LAYERS=0 -> Ok

abetlen · 2024-01-25T16:00:38Z

@BadisG going to close for now, feel free to reopen if the issue wasn't resolved.

BadisG changed the title ~~The new QX_K_M quants are producing glibberish~~ The new QX_K_M quants are producing gibberish Jan 16, 2024

xmaayy mentioned this issue Jan 16, 2024

Commit 7c898d5 breaks generation on GPU #1089

Closed

abetlen closed this as completed Jan 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The new QX_K_M quants are producing gibberish #1091

The new QX_K_M quants are producing gibberish #1091

BadisG commented Jan 16, 2024 •

edited

Loading

IsNoobgrammer commented Jan 16, 2024

LorenzoBoccaccia commented Jan 16, 2024

xmaayy commented Jan 16, 2024

iamlemec commented Jan 16, 2024

BadisG commented Jan 17, 2024

timtensor commented Jan 17, 2024

IgorBeHolder commented Jan 18, 2024

abetlen commented Jan 25, 2024

The new QX_K_M quants are producing gibberish #1091

The new QX_K_M quants are producing gibberish #1091

Comments

BadisG commented Jan 16, 2024 • edited Loading

IsNoobgrammer commented Jan 16, 2024

LorenzoBoccaccia commented Jan 16, 2024

xmaayy commented Jan 16, 2024

iamlemec commented Jan 16, 2024

BadisG commented Jan 17, 2024

timtensor commented Jan 17, 2024

IgorBeHolder commented Jan 18, 2024

abetlen commented Jan 25, 2024

BadisG commented Jan 16, 2024 •

edited

Loading