-
Notifications
You must be signed in to change notification settings - Fork 11.5k
Server slowing down with each request (requests are identical) #4201
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Comments
It's not something that only happens with the server example; it also occurs with the ./main example, so it's an internal issue with the function that has a regression in performance as the context is filled. |
Thanks for confirming @FSSRepo. Should this be moved to |
If the cache is cleared correctly, it would not be slower after each request, so there seems to be a server-specific problem. |
Upon reviewing carefully, it seems so, although I believe that only happens if the requests are launched at the same time in different slots. Too remember the token speed it's a mean from start the task to end the task. Sorry for edit your comment, i sometimes confuse quote with edit in GitHub Android App |
I've also seen this issue with ./server when used with The first request runs at full expected speed but the following requests generate slower (with identical prompt). |
This issue was closed because it has been inactive for 14 days since being marked as stale. |
Was this fixed? I'm still having this issue. |
I think it should be fixed. If you reproduce it, please provide the |
Did you ever figure out what was causing this, @ggerganov ? |
Pre-Prerequisite
Thanks to all the contributors for all the great work on llama.cpp!
Prerequisites
Expected Behaviour
Current Behaviour
prompt_eval
time gets much slower.Environment and Context
Physical (or virtual) hardware you are using: Physical hardware, Nvidia GPU
Operating System: Linux
Failure Information (for bugs)
Please help provide information about the failure / bug.
Steps to Reproduce
Thanks!
The text was updated successfully, but these errors were encountered: