examples/server.cpp: unexpected behavior when prompt of completion request is the same or contained in prior request #1726

WangHaoranRobin · 2023-06-07T00:46:52Z

Prerequisites

I am running the latest code. Development is very rapid so there are no tagged versions as of now.
I carefully followed the README.md.
I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
I reviewed the Discussions, and have a new bug or useful enhancement to share.

Expected Behavior

When the prompt of the current completion request is the same or contained in the prior request's prompt (tokens of which are now included in processed_tokens), e.i.

current_prompt_tokens: "[tkn1][tkn2][tkn3]"
last_prompt_tokens: "[tkn1][tkn2][tkn3]" or  "[tkn1][tkn2][tkn3][tkn4]"

The expected behavior is that the server outputs tokens based on the correct current prompt "[tkn1][tkn2][tkn3]"

Current Behavior

Currently, if the predicted tokens from the last completion request (say "[tkn1][tkn2][tkn3][tkn4]") was "[pred_tkn1][pred_tkn2][pred_tkn3]".
The server will return output using "[tkn1][tkn2][tkn3][tkn4][pred_tkn1][pred_tkn2][pred_tkn3]" as the prompt.

For example:

Requestion 1 prompt:

Text transcript of a never ending dialog, where User: interacts with an AI assistant named ChatBot.
ChatBot is helpful, kind, honest, friendly, good at writing and never fails to answer User:'s requests immediately and with details and precision.
There are no annotations like (30 seconds passed...) or (to himself), just what User: and ChatBot say aloud to each other.
The dialog lasts for years, the entirety of it is shared below. It's 10000 pages long.
The transcript only includes text, it does not include markup like HTML and Markdown.
...
User: What time is it?
ChatBot: It is 3:14 PM.
User: Who is the first president of the United States?
ChatBot: George Washington.
User: What is the capital of the United States?
ChatBot:

Request 1 result:

 The U.S. capital city has many names, including Washington D.C., D.C., and Washington.
The first two names were used by

Request 2 prompt:

[exactly the same as request 1 prompt]

Request 2 result:

the original architect L'Enfant to distinguish his work from that of other American cities.
User: Who lives in a pineapple under the sea

Environment and Context

Please provide detailed information about your computer setup. This is important in case the issue is not reproducible except for under certain specific conditions.

Physical (or virtual) hardware you are using, e.g. for Linux:

$ sysctl -a | grep machdep.cpu
machdep.cpu.cores_per_package: 12
machdep.cpu.core_count: 12
machdep.cpu.logical_per_package: 12
machdep.cpu.thread_count: 12
machdep.cpu.brand_string: Apple M2 Max

Operating System, e.g. for Linux:

$ uname -a
Darwin MacBook-Pro.local 22.6.0 Darwin Kernel Version 22.6.0: Mon May 22 20:21:10 PDT 2023; root:xnu-8796.140.17.505.3~3/RELEASE_ARM64_T6020 arm64

SDK version, e.g. for Linux:

$ python3 --version
Python 3.8.16
$ make --version
GNU Make 3.81
$ g++ --version
Apple clang version 14.0.3 (clang-1403.0.22.14.1)
Target: arm64-apple-darwin22.6.0
Thread model: posix
InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin

Steps to Reproduce

Please provide detailed steps for reproducing the issue. We are not sitting in front of your screen, so the more detail the better.

Pull the latest code and have the weights of a model ready
make server
./server -m models/13B/ggml-model-q4_0.bin --ctx_size 2048
Follow this guide to set up a client
Set n_predict: 32, or any other small value
Run the client with prompt:

Text transcript of a never ending dialog, where User: interacts with an AI assistant named ChatBot.
ChatBot is helpful, kind, honest, friendly, good at writing and never fails to answer User:'s requests immediately and with details and precision.
There are no annotations like (30 seconds passed...) or (to himself), just what User: and ChatBot say aloud to each other.
The dialog lasts for years, the entirety of it is shared below. It's 10000 pages long.
The transcript only includes text, it does not include markup like HTML and Markdown.

User: Hello, ChatBot!
ChatBot: Hello User:! How may I help you today?
User: What year is it?
ChatBot: We are in 2023.
User: Please tell me the largest city in Europe.
ChatBot: The largest city in Europe is Moscow, the capital of Russia.
User: What can you tell me about Moscow?
ChatBot: Moscow, on the Moskva River in western Russia, is the nation's cosmopolitan capital. In its historic core is the Kremlin, a complex that's home to the president and tsarist treasures in the Armoury. Outside its walls is Red Square, Russia’s symbolic center.
User: What is a cat?
ChatBot: A cat is a domestic species of small carnivorous mammal. It is the only domesticated species in the family Felidae.
User: How do I pass command line arguments to a Node.js program?
ChatBot: The arguments are stored in process.argv.

    argv[0] is the path to the Node. js executable.
    argv[1] is the path to the script file.
    argv[2] is the first argument passed to the script.
    argv[3] is the second argument passed to the script and so on.
User: Name a color.
ChatBot: Blue.
User: What time is it?
ChatBot: It is 3:14 PM.
User: Who is the first president of the United States?
ChatBot: George Washington.
User: What is the capital of the United States?
ChatBot:

Run the same prompt again

Working Solution

At the load prompt logic in server.cpp, I changed this if statement from
if (i < processed_tokens.size() && processed_tokens[i] == prompt_tokens[i])
to
if (i < processed_tokens.size() && processed_tokens[i] == prompt_tokens[i]) && i < prompt_tokens.size()-1

So that when processed_tokens completely contain the prompt_tokens, the program will still go to the else block and re-evaluate the prompt from the last token of prompt_token, instead of keeping processed_tokens and n_past the same.

But I've not done thorough testing yet and am fairly new to the code base. Could this be a potential PR?

The text was updated successfully, but these errors were encountered:

SlyEcho · 2023-06-07T10:26:13Z

The server is going to be changed in #1570

This issue is fixed there.

WangHaoranRobin · 2023-06-07T22:54:04Z

Thanks!

WangHaoranRobin closed this as completed Jun 7, 2023

Bearsaerker mentioned this issue Mar 12, 2025

Eval bug: Gemma 3 extremly slow prompt processing when using quantized kv cache. #12352

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

examples/server.cpp: unexpected behavior when prompt of completion request is the same or contained in prior request #1726

examples/server.cpp: unexpected behavior when prompt of completion request is the same or contained in prior request #1726

WangHaoranRobin commented Jun 7, 2023

SlyEcho commented Jun 7, 2023

WangHaoranRobin commented Jun 7, 2023

examples/server.cpp: unexpected behavior when prompt of completion request is the same or contained in prior request #1726

examples/server.cpp: unexpected behavior when prompt of completion request is the same or contained in prior request #1726

Comments

WangHaoranRobin commented Jun 7, 2023

Prerequisites

Expected Behavior

Current Behavior

For example:

Requestion 1 prompt:

Request 1 result:

Request 2 prompt:

Request 2 result:

Environment and Context

Steps to Reproduce

Working Solution

SlyEcho commented Jun 7, 2023

WangHaoranRobin commented Jun 7, 2023