Skip to content

Server slowing down with each request (requests are identical) #4201

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Closed
4 tasks done
ruped opened this issue Nov 24, 2023 · 9 comments
Closed
4 tasks done

Server slowing down with each request (requests are identical) #4201

ruped opened this issue Nov 24, 2023 · 9 comments
Assignees
Labels
bug Something isn't working stale

Comments

@ruped
Copy link

ruped commented Nov 24, 2023

Pre-Prerequisite

Thanks to all the contributors for all the great work on llama.cpp!

Prerequisites

  • I am running the latest code. Development is very rapid so there are no tagged versions as of now.
  • I carefully followed the README.md.
  • I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • I reviewed the Discussions, and have a new bug or useful enhancement to share.

Expected Behaviour

  • Identical requests to the server should take similar amounts of time
    • I made 4 identical requests to the server, sequential with 1 second pause after each.

Current Behaviour

  • Each request to the server takes longer to process.
  • Note particularly the prompt_eval time gets much slower.
  • Also: After the last request it says "slot 3 released (1155 tokens in cache)" - is that expected?
    • Why should there be data in the cache after the slot has been released and there are no active requests?
print_timings: prompt eval time =    1110.28 ms /  1134 tokens (    0.98 ms per token,  1021.37 tokens per second)
print_timings:        eval time =     561.83 ms /    20 runs   (   28.09 ms per token,    35.60 tokens per second)
print_timings:       total time =    1672.10 ms
slot 0 released (1155 tokens in cache)

slot 1 is processing [task id: 1]
slot 1 : kv cache rm - [0, end)
print_timings: prompt eval time =    1483.84 ms /  1134 tokens (    1.31 ms per token,   764.23 tokens per second)
print_timings:        eval time =     591.59 ms /    20 runs   (   29.58 ms per token,    33.81 tokens per second)
print_timings:       total time =    2075.43 ms
slot 1 released (1155 tokens in cache)

slot 2 is processing [task id: 2]
slot 2 : kv cache rm - [0, end)
print_timings: prompt eval time =    1764.20 ms /  1134 tokens (    1.56 ms per token,   642.78 tokens per second)
print_timings:        eval time =     618.07 ms /    20 runs   (   30.90 ms per token,    32.36 tokens per second)
print_timings:       total time =    2382.28 ms
slot 2 released (1155 tokens in cache)

slot 3 is processing [task id: 3]
slot 3 : kv cache rm - [0, end)
print_timings: prompt eval time =    2229.91 ms /  1134 tokens (    1.97 ms per token,   508.54 tokens per second)
print_timings:        eval time =     642.50 ms /    20 runs   (   32.12 ms per token,    31.13 tokens per second)
print_timings:       total time =    2872.41 ms
slot 3 released (1155 tokens in cache)

Environment and Context

  • Physical (or virtual) hardware you are using: Physical hardware, Nvidia GPU

  • Operating System: Linux

Failure Information (for bugs)

Please help provide information about the failure / bug.

Steps to Reproduce

  • Start server
 (./build/bin/server -m ./ggml-model-q4.bin -ngl 9999  --ctx-size 8000 --host 0.0.0.0 --port 7777 --cont-batching --parallel 4)
  • Make requests to server with curl sequentially (one at a time).

Thanks!

@FSSRepo
Copy link
Collaborator

FSSRepo commented Nov 24, 2023

It's not something that only happens with the server example; it also occurs with the ./main example, so it's an internal issue with the function that has a regression in performance as the context is filled.

@ruped
Copy link
Author

ruped commented Nov 27, 2023

Thanks for confirming @FSSRepo.

Should this be moved to bug-confirmed status? Or is that not really necessary for it?

@ggerganov ggerganov added bug Something isn't working and removed bug-unconfirmed labels Nov 27, 2023
@ggerganov ggerganov self-assigned this Nov 27, 2023
@ggerganov
Copy link
Member

ggerganov commented Nov 27, 2023

It's not something that only happens with the server example; it also occurs with the ./main example, so it's an internal issue with the function that has a regression in performance as the context is filled.

If the cache is cleared correctly, it would not be slower after each request, so there seems to be a server-specific problem.

@FSSRepo
Copy link
Collaborator

FSSRepo commented Nov 27, 2023

Upon reviewing carefully, it seems so, although I believe that only happens if the requests are launched at the same time in different slots.

Too remember the token speed it's a mean from start the task to end the task.

Sorry for edit your comment, i sometimes confuse quote with edit in GitHub Android App

@LoopControl
Copy link

I've also seen this issue with ./server when used with -cb and -np flags on my Linux system.

The first request runs at full expected speed but the following requests generate slower (with identical prompt).

@github-actions github-actions bot added the stale label Mar 19, 2024
Copy link
Contributor

github-actions bot commented Apr 3, 2024

This issue was closed because it has been inactive for 14 days since being marked as stale.

@github-actions github-actions bot closed this as completed Apr 3, 2024
@NeuralNotwerk
Copy link

Was this fixed? I'm still having this issue.

@ggerganov
Copy link
Member

I think it should be fixed. If you reproduce it, please provide the curl command that you are using to test

@jwnz
Copy link

jwnz commented Apr 2, 2025

Did you ever figure out what was causing this, @ggerganov ?

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
bug Something isn't working stale
Projects
None yet
Development

No branches or pull requests

6 participants