-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Commit 7c898d5
breaks generation on GPU
#1089
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Comments
@iamlemec will investigate |
Same here.
I am using the fastapi server. I observed that the server could generate meaningful response at first a few short inputs. When I asked it to response to a long input, it repeated Downgrade |
Hmmm, have not been able to reproduce this on my Nvidia 2060 card with |
I have a reproduction example on the other issue #1091 (comment) |
I think I found the answer! You need to set |
@iamlemec yup that broke it for me too, great catch. Can you try it out with llama.cpp with |
Yup, getting all #s from the same prompt as above using |
This in indeed a bug in llama.cpp, but I would strongly recommend enabling |
I can confirm that version 0.2.29 breaks generation completely for different models. |
I seem to end up with 1.1 tokens per second instead of 1.5 after enabling the flag and adjusting the layers to refit everything into VRAM. It is a really bad setup (2600X & RTX1080) running nous-hermes-2-yi-34b.Q4_K_M.gguf with now 14 instead of 18 layers offloaded to GPU. Just providing this as a data point. |
Thanks @slaren , will do that! |
|
Issue is fixed for me with version 0.2.31. Thank you! |
I updated to 0.2.31 and with offload_kqv=False i still get gibberish. Looks like "▅体型做为畅销refresh LPона за junior后台飞行员сен sizeof", so a bit more creative than the ### we had before. They turned out as these black boxes in my gui, so I guess it still starts somewhat like that. Using a nous-hermes-2-yi-34b.Q4_K_M.gguf as described above. Its the first reply in a fresh run, no state loading or similar. |
@iactix Yup, this issue was partially driven by a bug in |
It now works with that flag set to False. However I fail to see the difference in behavior? I could swear I calibrated it to come out below 8 GB of vram usage in my old version (I updated for the state loading fix) but it now keeps coming out at 9.2, when I have the flag on or off. The speed seems the same too. And as reducing layers doesn't really make it faster, I still end up having about 1.2 tps instead of the 1.5 i had in an earlier version. Honestly I don't know how much the kv thing should even affect vram usage at 4K context (but fresh generation after a sizeable system prompt), so it may actually be other things that affectd the speed. |
From commit
7c898d
onwards, the output of any type of generation/completion on the GPU is just "#" repeated forever. For instance, using the example fromREADME.md
The output is the following repeated:
Generation works fine on the CPU and for previous commits. Doesn't seem to be related to quantization or model type. Interestingly, generation also works using pure
llama.cpp
through themain
interface for both CPU and GPU. I tested this out for the currentmaster
and the commits around the above change (notably76484fb
and1d11838
). I also managed to get it working inllama-cpp-python
using the low level API, just using simple batching andllama_decode
.Environment info:
Thanks!
The text was updated successfully, but these errors were encountered: