Skip to content

examples/server.cpp: unexpected behavior when prompt of completion request is the same or contained in prior request #1726

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Closed
4 tasks done
WangHaoranRobin opened this issue Jun 7, 2023 · 2 comments

Comments

@WangHaoranRobin
Copy link
Contributor

Prerequisites

  • I am running the latest code. Development is very rapid so there are no tagged versions as of now.
  • I carefully followed the README.md.
  • I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • I reviewed the Discussions, and have a new bug or useful enhancement to share.

Expected Behavior

When the prompt of the current completion request is the same or contained in the prior request's prompt (tokens of which are now included in processed_tokens), e.i.

current_prompt_tokens: "[tkn1][tkn2][tkn3]"
last_prompt_tokens: "[tkn1][tkn2][tkn3]" or  "[tkn1][tkn2][tkn3][tkn4]"

The expected behavior is that the server outputs tokens based on the correct current prompt "[tkn1][tkn2][tkn3]"

Current Behavior

Currently, if the predicted tokens from the last completion request (say "[tkn1][tkn2][tkn3][tkn4]") was "[pred_tkn1][pred_tkn2][pred_tkn3]".
The server will return output using "[tkn1][tkn2][tkn3][tkn4][pred_tkn1][pred_tkn2][pred_tkn3]" as the prompt.

For example:

Requestion 1 prompt:

Text transcript of a never ending dialog, where User: interacts with an AI assistant named ChatBot.
ChatBot is helpful, kind, honest, friendly, good at writing and never fails to answer User:'s requests immediately and with details and precision.
There are no annotations like (30 seconds passed...) or (to himself), just what User: and ChatBot say aloud to each other.
The dialog lasts for years, the entirety of it is shared below. It's 10000 pages long.
The transcript only includes text, it does not include markup like HTML and Markdown.
...
User: What time is it?
ChatBot: It is 3:14 PM.
User: Who is the first president of the United States?
ChatBot: George Washington.
User: What is the capital of the United States?
ChatBot:

Request 1 result:

 The U.S. capital city has many names, including Washington D.C., D.C., and Washington.
The first two names were used by

Request 2 prompt:

[exactly the same as request 1 prompt]

Request 2 result:

the original architect L'Enfant to distinguish his work from that of other American cities.
User: Who lives in a pineapple under the sea

Environment and Context

Please provide detailed information about your computer setup. This is important in case the issue is not reproducible except for under certain specific conditions.

  • Physical (or virtual) hardware you are using, e.g. for Linux:
$ sysctl -a | grep machdep.cpu
machdep.cpu.cores_per_package: 12
machdep.cpu.core_count: 12
machdep.cpu.logical_per_package: 12
machdep.cpu.thread_count: 12
machdep.cpu.brand_string: Apple M2 Max
  • Operating System, e.g. for Linux:
$ uname -a
Darwin MacBook-Pro.local 22.6.0 Darwin Kernel Version 22.6.0: Mon May 22 20:21:10 PDT 2023; root:xnu-8796.140.17.505.3~3/RELEASE_ARM64_T6020 arm64
  • SDK version, e.g. for Linux:
$ python3 --version
Python 3.8.16
$ make --version
GNU Make 3.81
$ g++ --version
Apple clang version 14.0.3 (clang-1403.0.22.14.1)
Target: arm64-apple-darwin22.6.0
Thread model: posix
InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin

Steps to Reproduce

Please provide detailed steps for reproducing the issue. We are not sitting in front of your screen, so the more detail the better.

  1. Pull the latest code and have the weights of a model ready
  2. make server
  3. ./server -m models/13B/ggml-model-q4_0.bin --ctx_size 2048
  4. Follow this guide to set up a client
  5. Set n_predict: 32, or any other small value
  6. Run the client with prompt:
Text transcript of a never ending dialog, where User: interacts with an AI assistant named ChatBot.
ChatBot is helpful, kind, honest, friendly, good at writing and never fails to answer User:'s requests immediately and with details and precision.
There are no annotations like (30 seconds passed...) or (to himself), just what User: and ChatBot say aloud to each other.
The dialog lasts for years, the entirety of it is shared below. It's 10000 pages long.
The transcript only includes text, it does not include markup like HTML and Markdown.

User: Hello, ChatBot!
ChatBot: Hello User:! How may I help you today?
User: What year is it?
ChatBot: We are in 2023.
User: Please tell me the largest city in Europe.
ChatBot: The largest city in Europe is Moscow, the capital of Russia.
User: What can you tell me about Moscow?
ChatBot: Moscow, on the Moskva River in western Russia, is the nation's cosmopolitan capital. In its historic core is the Kremlin, a complex that's home to the president and tsarist treasures in the Armoury. Outside its walls is Red Square, Russia’s symbolic center.
User: What is a cat?
ChatBot: A cat is a domestic species of small carnivorous mammal. It is the only domesticated species in the family Felidae.
User: How do I pass command line arguments to a Node.js program?
ChatBot: The arguments are stored in process.argv.

    argv[0] is the path to the Node. js executable.
    argv[1] is the path to the script file.
    argv[2] is the first argument passed to the script.
    argv[3] is the second argument passed to the script and so on.
User: Name a color.
ChatBot: Blue.
User: What time is it?
ChatBot: It is 3:14 PM.
User: Who is the first president of the United States?
ChatBot: George Washington.
User: What is the capital of the United States?
ChatBot:
  1. Run the same prompt again

Working Solution

At the load prompt logic in server.cpp, I changed this if statement from
if (i < processed_tokens.size() && processed_tokens[i] == prompt_tokens[i])
to
if (i < processed_tokens.size() && processed_tokens[i] == prompt_tokens[i]) && i < prompt_tokens.size()-1

So that when processed_tokens completely contain the prompt_tokens, the program will still go to the else block and re-evaluate the prompt from the last token of prompt_token, instead of keeping processed_tokens and n_past the same.

But I've not done thorough testing yet and am fairly new to the code base. Could this be a potential PR?

@SlyEcho
Copy link
Collaborator

SlyEcho commented Jun 7, 2023

The server is going to be changed in #1570

This issue is fixed there.

@WangHaoranRobin
Copy link
Contributor Author

Thanks!

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants