Skip to content

Commit 7c898d5 breaks generation on GPU #1089

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Closed
iamlemec opened this issue Jan 15, 2024 · 17 comments
Closed

Commit 7c898d5 breaks generation on GPU #1089

iamlemec opened this issue Jan 15, 2024 · 17 comments
Labels
bug Something isn't working

Comments

@iamlemec
Copy link
Contributor

iamlemec commented Jan 15, 2024

From commit 7c898d onwards, the output of any type of generation/completion on the GPU is just "#" repeated forever. For instance, using the example from README.md

from llama_cpp import Llama
llm = Llama(model_path='models/llama2-7b.q4_0.gguf', n_gpu_layers=100)
for s in llm('Building a website can be done in 10 simple steps:\nStep 1:', stream=True):
    print(s)

The output is the following repeated:

{'id': 'cmpl-14ed3b80-49af-453d-99a4-c7925f5680f7', 'object': 'text_completion', 'created': 1705351368, 'model': 'models/llama2-7b.q4_0.gguf', 'choices': [{'text': '#', 'index': 0, 'logprobs': None, 'finish_reason': None}]}

Generation works fine on the CPU and for previous commits. Doesn't seem to be related to quantization or model type. Interestingly, generation also works using pure llama.cpp through the main interface for both CPU and GPU. I tested this out for the current master and the commits around the above change (notably 76484fb and 1d11838). I also managed to get it working in llama-cpp-python using the low level API, just using simple batching and llama_decode.

Environment info:

GPU: RTX A6000
OS: Linux 6.6.0-0.rc5
CUDA SDK: 12.2
CUDA Drivers: 535.113.01

Thanks!

@abetlen abetlen added the bug Something isn't working label Jan 15, 2024
@abetlen
Copy link
Owner

abetlen commented Jan 15, 2024

@iamlemec will investigate

@markyfsun
Copy link

Same here.

llama_cpp_python: 0.2.29
GPU: NVIDIA 3090
OS: Ubuntu 22.04
CUDA SDK: 12.2
CUDA Drivers: 535.146.02

I am using the fastapi server. I observed that the server could generate meaningful response at first a few short inputs. When I asked it to response to a long input, it repeated # forever. Then I retried with previous short inputs, I got only #.

Downgrade llama_cpp_python to 0.2.28 solves the issue.

@abetlen
Copy link
Owner

abetlen commented Jan 16, 2024

Hmmm, have not been able to reproduce this on my Nvidia 2060 card with q4_k_m quants.

@xmaayy
Copy link

xmaayy commented Jan 16, 2024

Hmmm, have not been able to reproduce this on my Nvidia 2060 card with q4_k_m quants.

I have a reproduction example on the other issue #1091 (comment)

@iamlemec
Copy link
Contributor Author

I think I found the answer! You need to set offload_kqv=True for things to work. The default in the Llama class is False, but the underlying default from llama_context_default_params is True, which explains why it was working with the low level API.

@abetlen
Copy link
Owner

abetlen commented Jan 16, 2024

@iamlemec yup that broke it for me too, great catch. Can you try it out with llama.cpp with --no_kv_offload, if it's present there then we can open an upstream issue and get it fixed!

@iamlemec
Copy link
Contributor Author

Yup, getting all #s from the same prompt as above using --no_kv_offload! Do you want to take the lead on the upstream bug or shall I?

@slaren
Copy link

slaren commented Jan 18, 2024

This in indeed a bug in llama.cpp, but I would strongly recommend enabling offload_kqv by default, as it is in llama.cpp. Even in cases with low VRAM, it is usually better to offload less layers and keep offload_kqv enabled.

@m-from-space
Copy link

I can confirm that version 0.2.29 breaks generation completely for different models.

@iactix
Copy link

iactix commented Jan 18, 2024

This in indeed a bug in llama.cpp, but I would strongly recommend enabling offload_kqv by default, as it is in llama.cpp. Even in cases with low VRAM, it is usually better to offload less layers and keep offload_kqv enabled.

I seem to end up with 1.1 tokens per second instead of 1.5 after enabling the flag and adjusting the layers to refit everything into VRAM. It is a really bad setup (2600X & RTX1080) running nous-hermes-2-yi-34b.Q4_K_M.gguf with now 14 instead of 18 layers offloaded to GPU. Just providing this as a data point.

@abetlen
Copy link
Owner

abetlen commented Jan 18, 2024

This in indeed a bug in llama.cpp, but I would strongly recommend enabling offload_kqv by default, as it is in llama.cpp. Even in cases with low VRAM, it is usually better to offload less layers and keep offload_kqv enabled.

Thanks @slaren , will do that!

@abetlen
Copy link
Owner

abetlen commented Jan 19, 2024

offload_kqv is now set to True by default starting from version 0.2.30

@m-from-space
Copy link

I can confirm that version 0.2.29 breaks generation completely for different models.

Issue is fixed for me with version 0.2.31. Thank you!

@iactix
Copy link

iactix commented Jan 21, 2024

I updated to 0.2.31 and with offload_kqv=False i still get gibberish. Looks like "▅体型做为畅销refresh LPона за junior后台飞行员сен sizeof", so a bit more creative than the ### we had before. They turned out as these black boxes in my gui, so I guess it still starts somewhat like that. Using a nous-hermes-2-yi-34b.Q4_K_M.gguf as described above. Its the first reply in a fresh run, no state loading or similar.

@iamlemec
Copy link
Contributor Author

@iactix Yup, this issue was partially driven by a bug in llama.cpp itself (ggml-org/llama.cpp#4991). That bug has been fixed in upstream llama.cpp but hasn't been propagated to llama-cpp-python yet. I assume this will happen in the next release. Until then, only offload_kqv=True (the new default) will work.

@abetlen
Copy link
Owner

abetlen commented Jan 22, 2024

@iamlemec @iactix should be in 0.2.32 let me know if that works! @iamlemec thanks again for all the help identifying this issue!

@iactix
Copy link

iactix commented Jan 22, 2024

@iamlemec @iactix should be in 0.2.32 let me know if that works! @iamlemec thanks again for all the help identifying this issue!

It now works with that flag set to False. However I fail to see the difference in behavior? I could swear I calibrated it to come out below 8 GB of vram usage in my old version (I updated for the state loading fix) but it now keeps coming out at 9.2, when I have the flag on or off. The speed seems the same too. And as reducing layers doesn't really make it faster, I still end up having about 1.2 tps instead of the 1.5 i had in an earlier version. Honestly I don't know how much the kv thing should even affect vram usage at 4K context (but fresh generation after a sizeable system prompt), so it may actually be other things that affectd the speed.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

7 participants