Commit `7c898d5` breaks generation on GPU #1089

iamlemec · 2024-01-15T20:56:34Z

From commit 7c898d onwards, the output of any type of generation/completion on the GPU is just "#" repeated forever. For instance, using the example from README.md

from llama_cpp import Llama
llm = Llama(model_path='models/llama2-7b.q4_0.gguf', n_gpu_layers=100)
for s in llm('Building a website can be done in 10 simple steps:\nStep 1:', stream=True):
    print(s)

The output is the following repeated:

{'id': 'cmpl-14ed3b80-49af-453d-99a4-c7925f5680f7', 'object': 'text_completion', 'created': 1705351368, 'model': 'models/llama2-7b.q4_0.gguf', 'choices': [{'text': '#', 'index': 0, 'logprobs': None, 'finish_reason': None}]}

Generation works fine on the CPU and for previous commits. Doesn't seem to be related to quantization or model type. Interestingly, generation also works using pure llama.cpp through the main interface for both CPU and GPU. I tested this out for the current master and the commits around the above change (notably 76484fb and 1d11838). I also managed to get it working in llama-cpp-python using the low level API, just using simple batching and llama_decode.

Environment info:

GPU: RTX A6000
OS: Linux 6.6.0-0.rc5
CUDA SDK: 12.2
CUDA Drivers: 535.113.01

Thanks!

The text was updated successfully, but these errors were encountered:

abetlen · 2024-01-15T21:30:55Z

@iamlemec will investigate

markyfsun · 2024-01-16T07:33:45Z

Same here.

llama_cpp_python: 0.2.29
GPU: NVIDIA 3090
OS: Ubuntu 22.04
CUDA SDK: 12.2
CUDA Drivers: 535.146.02

I am using the fastapi server. I observed that the server could generate meaningful response at first a few short inputs. When I asked it to response to a long input, it repeated # forever. Then I retried with previous short inputs, I got only #.

Downgrade llama_cpp_python to 0.2.28 solves the issue.

abetlen · 2024-01-16T17:59:09Z

Hmmm, have not been able to reproduce this on my Nvidia 2060 card with q4_k_m quants.

xmaayy · 2024-01-16T18:11:07Z

Hmmm, have not been able to reproduce this on my Nvidia 2060 card with q4_k_m quants.

I have a reproduction example on the other issue #1091 (comment)

iamlemec · 2024-01-16T18:57:04Z

I think I found the answer! You need to set offload_kqv=True for things to work. The default in the Llama class is False, but the underlying default from llama_context_default_params is True, which explains why it was working with the low level API.

abetlen · 2024-01-16T20:28:19Z

@iamlemec yup that broke it for me too, great catch. Can you try it out with llama.cpp with --no_kv_offload, if it's present there then we can open an upstream issue and get it fixed!

iamlemec · 2024-01-16T21:14:52Z

Yup, getting all #s from the same prompt as above using --no_kv_offload! Do you want to take the lead on the upstream bug or shall I?

slaren · 2024-01-18T02:20:27Z

This in indeed a bug in llama.cpp, but I would strongly recommend enabling offload_kqv by default, as it is in llama.cpp. Even in cases with low VRAM, it is usually better to offload less layers and keep offload_kqv enabled.

m-from-space · 2024-01-18T09:13:59Z

I can confirm that version 0.2.29 breaks generation completely for different models.

iactix · 2024-01-18T15:47:08Z

This in indeed a bug in llama.cpp, but I would strongly recommend enabling offload_kqv by default, as it is in llama.cpp. Even in cases with low VRAM, it is usually better to offload less layers and keep offload_kqv enabled.

I seem to end up with 1.1 tokens per second instead of 1.5 after enabling the flag and adjusting the layers to refit everything into VRAM. It is a really bad setup (2600X & RTX1080) running nous-hermes-2-yi-34b.Q4_K_M.gguf with now 14 instead of 18 layers offloaded to GPU. Just providing this as a data point.

abetlen · 2024-01-18T16:02:25Z

This in indeed a bug in llama.cpp, but I would strongly recommend enabling offload_kqv by default, as it is in llama.cpp. Even in cases with low VRAM, it is usually better to offload less layers and keep offload_kqv enabled.

Thanks @slaren , will do that!

abetlen · 2024-01-19T02:32:09Z

offload_kqv is now set to True by default starting from version 0.2.30

m-from-space · 2024-01-20T11:40:42Z

I can confirm that version 0.2.29 breaks generation completely for different models.

Issue is fixed for me with version 0.2.31. Thank you!

iactix · 2024-01-21T10:12:31Z

I updated to 0.2.31 and with offload_kqv=False i still get gibberish. Looks like "▅体型做为畅销refresh LPона за junior后台飞行员сен sizeof", so a bit more creative than the ### we had before. They turned out as these black boxes in my gui, so I guess it still starts somewhat like that. Using a nous-hermes-2-yi-34b.Q4_K_M.gguf as described above. Its the first reply in a fresh run, no state loading or similar.

iamlemec · 2024-01-21T21:43:50Z

@iactix Yup, this issue was partially driven by a bug in llama.cpp itself (ggml-org/llama.cpp#4991). That bug has been fixed in upstream llama.cpp but hasn't been propagated to llama-cpp-python yet. I assume this will happen in the next release. Until then, only offload_kqv=True (the new default) will work.

abetlen · 2024-01-22T14:48:33Z

@iamlemec @iactix should be in 0.2.32 let me know if that works! @iamlemec thanks again for all the help identifying this issue!

iactix · 2024-01-22T16:03:49Z

@iamlemec @iactix should be in 0.2.32 let me know if that works! @iamlemec thanks again for all the help identifying this issue!

It now works with that flag set to False. However I fail to see the difference in behavior? I could swear I calibrated it to come out below 8 GB of vram usage in my old version (I updated for the state loading fix) but it now keeps coming out at 9.2, when I have the flag on or off. The speed seems the same too. And as reducing layers doesn't really make it faster, I still end up having about 1.2 tps instead of the 1.5 i had in an earlier version. Honestly I don't know how much the kv thing should even affect vram usage at 4K context (but fresh generation after a sizeable system prompt), so it may actually be other things that affectd the speed.

abetlen added the bug Something isn't working label Jan 15, 2024

iamlemec mentioned this issue Jan 16, 2024

The new QX_K_M quants are producing gibberish #1091

Closed

abetlen mentioned this issue Jan 17, 2024

Token generation broken on CUDA when offload_kqv is false ggml-org/llama.cpp#4991

Closed

shepard153 mentioned this issue Jan 17, 2024

LLM Chat only returns "#" characters zylon-ai/private-gpt#1514

Closed

aniljava mentioned this issue Jan 18, 2024

Unexpected end of JSON input #1083

Closed

4 tasks

iamlemec mentioned this issue Jan 20, 2024

When use the GPU, llama-cpp-python[server] keeps returning # ggml-org/llama.cpp#5014

Closed

berkut1 mentioned this issue Jan 23, 2024

The latest llama-cpp-python builds give gibberish respond oobabooga/text-generation-webui#5354

Closed

1 task

abetlen closed this as completed Jan 31, 2024

NonaSuomy mentioned this issue Feb 3, 2024

Trying to load the new v2 model in webui which it is not loading but old v1 model does. acon96/home-llm#27

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Commit `7c898d5` breaks generation on GPU #1089

Commit `7c898d5` breaks generation on GPU #1089

iamlemec commented Jan 15, 2024 •

edited

Loading

abetlen commented Jan 15, 2024

markyfsun commented Jan 16, 2024

abetlen commented Jan 16, 2024

xmaayy commented Jan 16, 2024

iamlemec commented Jan 16, 2024

abetlen commented Jan 16, 2024 •

edited

Loading

iamlemec commented Jan 16, 2024

slaren commented Jan 18, 2024

m-from-space commented Jan 18, 2024

iactix commented Jan 18, 2024

abetlen commented Jan 18, 2024

abetlen commented Jan 19, 2024

m-from-space commented Jan 20, 2024

iactix commented Jan 21, 2024 •

edited

Loading

iamlemec commented Jan 21, 2024

abetlen commented Jan 22, 2024

iactix commented Jan 22, 2024

Commit 7c898d5 breaks generation on GPU #1089

Commit 7c898d5 breaks generation on GPU #1089

Comments

iamlemec commented Jan 15, 2024 • edited Loading

abetlen commented Jan 15, 2024

markyfsun commented Jan 16, 2024

abetlen commented Jan 16, 2024

xmaayy commented Jan 16, 2024

iamlemec commented Jan 16, 2024

abetlen commented Jan 16, 2024 • edited Loading

iamlemec commented Jan 16, 2024

slaren commented Jan 18, 2024

m-from-space commented Jan 18, 2024

iactix commented Jan 18, 2024

abetlen commented Jan 18, 2024

abetlen commented Jan 19, 2024

m-from-space commented Jan 20, 2024

iactix commented Jan 21, 2024 • edited Loading

iamlemec commented Jan 21, 2024

abetlen commented Jan 22, 2024

iactix commented Jan 22, 2024

Commit `7c898d5` breaks generation on GPU #1089

Commit `7c898d5` breaks generation on GPU #1089

iamlemec commented Jan 15, 2024 •

edited

Loading

abetlen commented Jan 16, 2024 •

edited

Loading

iactix commented Jan 21, 2024 •

edited

Loading