Server slowing down with each request (requests are identical) #4201

ruped · 2023-11-24T12:15:58Z

Pre-Prerequisite

Thanks to all the contributors for all the great work on llama.cpp!

Prerequisites

I am running the latest code. Development is very rapid so there are no tagged versions as of now.
I carefully followed the README.md.
I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
I reviewed the Discussions, and have a new bug or useful enhancement to share.

Expected Behaviour

Identical requests to the server should take similar amounts of time
- I made 4 identical requests to the server, sequential with 1 second pause after each.

Current Behaviour

Each request to the server takes longer to process.
Note particularly the prompt_eval time gets much slower.
Also: After the last request it says "slot 3 released (1155 tokens in cache)" - is that expected?
- Why should there be data in the cache after the slot has been released and there are no active requests?

print_timings: prompt eval time =    1110.28 ms /  1134 tokens (    0.98 ms per token,  1021.37 tokens per second)
print_timings:        eval time =     561.83 ms /    20 runs   (   28.09 ms per token,    35.60 tokens per second)
print_timings:       total time =    1672.10 ms
slot 0 released (1155 tokens in cache)

slot 1 is processing [task id: 1]
slot 1 : kv cache rm - [0, end)
print_timings: prompt eval time =    1483.84 ms /  1134 tokens (    1.31 ms per token,   764.23 tokens per second)
print_timings:        eval time =     591.59 ms /    20 runs   (   29.58 ms per token,    33.81 tokens per second)
print_timings:       total time =    2075.43 ms
slot 1 released (1155 tokens in cache)

slot 2 is processing [task id: 2]
slot 2 : kv cache rm - [0, end)
print_timings: prompt eval time =    1764.20 ms /  1134 tokens (    1.56 ms per token,   642.78 tokens per second)
print_timings:        eval time =     618.07 ms /    20 runs   (   30.90 ms per token,    32.36 tokens per second)
print_timings:       total time =    2382.28 ms
slot 2 released (1155 tokens in cache)

slot 3 is processing [task id: 3]
slot 3 : kv cache rm - [0, end)
print_timings: prompt eval time =    2229.91 ms /  1134 tokens (    1.97 ms per token,   508.54 tokens per second)
print_timings:        eval time =     642.50 ms /    20 runs   (   32.12 ms per token,    31.13 tokens per second)
print_timings:       total time =    2872.41 ms
slot 3 released (1155 tokens in cache)

Environment and Context

Physical (or virtual) hardware you are using: Physical hardware, Nvidia GPU
Operating System: Linux

Failure Information (for bugs)

Please help provide information about the failure / bug.

Steps to Reproduce

Start server

 (./build/bin/server -m ./ggml-model-q4.bin -ngl 9999  --ctx-size 8000 --host 0.0.0.0 --port 7777 --cont-batching --parallel 4)

Make requests to server with curl sequentially (one at a time).

Thanks!

The text was updated successfully, but these errors were encountered:

FSSRepo · 2023-11-24T12:19:21Z

It's not something that only happens with the server example; it also occurs with the ./main example, so it's an internal issue with the function that has a regression in performance as the context is filled.

ruped · 2023-11-27T22:33:46Z

Thanks for confirming @FSSRepo.

Should this be moved to bug-confirmed status? Or is that not really necessary for it?

ggerganov · 2023-11-27T22:40:59Z

It's not something that only happens with the server example; it also occurs with the ./main example, so it's an internal issue with the function that has a regression in performance as the context is filled.

If the cache is cleared correctly, it would not be slower after each request, so there seems to be a server-specific problem.

FSSRepo · 2023-11-27T22:48:41Z

Upon reviewing carefully, it seems so, although I believe that only happens if the requests are launched at the same time in different slots.

Too remember the token speed it's a mean from start the task to end the task.

Sorry for edit your comment, i sometimes confuse quote with edit in GitHub Android App

LoopControl · 2023-11-28T23:49:54Z

I've also seen this issue with ./server when used with -cb and -np flags on my Linux system.

The first request runs at full expected speed but the following requests generate slower (with identical prompt).

github-actions · 2024-04-03T01:15:10Z

This issue was closed because it has been inactive for 14 days since being marked as stale.

NeuralNotwerk · 2024-04-25T19:18:14Z

Was this fixed? I'm still having this issue.

ggerganov · 2024-04-25T19:35:35Z

I think it should be fixed. If you reproduce it, please provide the curl command that you are using to test

jwnz · 2025-04-02T15:34:10Z

Did you ever figure out what was causing this, @ggerganov ?

ruped added the bug-unconfirmed label Nov 24, 2023

ruped mentioned this issue Nov 25, 2023

server : improvements and maintenance #4216

Open

10 tasks

ggerganov added bug and removed bug-unconfirmed labels Nov 27, 2023

ggerganov self-assigned this Nov 27, 2023

github-actions bot added the stale label Mar 19, 2024

github-actions bot closed this as completed Apr 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Server slowing down with each request (requests are identical) #4201

Server slowing down with each request (requests are identical) #4201

ruped commented Nov 24, 2023

FSSRepo commented Nov 24, 2023

ruped commented Nov 27, 2023

ggerganov commented Nov 27, 2023 •

edited by FSSRepo

Loading

FSSRepo commented Nov 27, 2023 •

edited

Loading

LoopControl commented Nov 28, 2023

github-actions bot commented Apr 3, 2024

NeuralNotwerk commented Apr 25, 2024

ggerganov commented Apr 25, 2024

jwnz commented Apr 2, 2025

Server slowing down with each request (requests are identical) #4201

Server slowing down with each request (requests are identical) #4201

Comments

ruped commented Nov 24, 2023

Pre-Prerequisite

Prerequisites

Expected Behaviour

Current Behaviour

Environment and Context

Failure Information (for bugs)

Steps to Reproduce

FSSRepo commented Nov 24, 2023

ruped commented Nov 27, 2023

ggerganov commented Nov 27, 2023 • edited by FSSRepo Loading

FSSRepo commented Nov 27, 2023 • edited Loading

LoopControl commented Nov 28, 2023

github-actions bot commented Apr 3, 2024

NeuralNotwerk commented Apr 25, 2024

ggerganov commented Apr 25, 2024

jwnz commented Apr 2, 2025

ggerganov commented Nov 27, 2023 •

edited by FSSRepo

Loading

FSSRepo commented Nov 27, 2023 •

edited

Loading