You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I reviewed the Discussions, and have a new bug or useful enhancement to share.
Expected Behavior
When the prompt of the current completion request is the same or contained in the prior request's prompt (tokens of which are now included in processed_tokens), e.i.
current_prompt_tokens: "[tkn1][tkn2][tkn3]"
last_prompt_tokens: "[tkn1][tkn2][tkn3]" or "[tkn1][tkn2][tkn3][tkn4]"
The expected behavior is that the server outputs tokens based on the correct current prompt "[tkn1][tkn2][tkn3]"
Current Behavior
Currently, if the predicted tokens from the last completion request (say "[tkn1][tkn2][tkn3][tkn4]") was "[pred_tkn1][pred_tkn2][pred_tkn3]".
The server will return output using "[tkn1][tkn2][tkn3][tkn4][pred_tkn1][pred_tkn2][pred_tkn3]" as the prompt.
For example:
Requestion 1 prompt:
Text transcript of a never ending dialog, where User: interacts with an AI assistant named ChatBot.
ChatBot is helpful, kind, honest, friendly, good at writing and never fails to answer User:'s requests immediately and with details and precision.
There are no annotations like (30 seconds passed...) or (to himself), just what User: and ChatBot say aloud to each other.
The dialog lasts for years, the entirety of it is shared below. It's 10000 pages long.
The transcript only includes text, it does not include markup like HTML and Markdown.
...
User: What time is it?
ChatBot: It is 3:14 PM.
User: Who is the first president of the United States?
ChatBot: George Washington.
User: What is the capital of the United States?
ChatBot:
Request 1 result:
The U.S. capital city has many names, including Washington D.C., D.C., and Washington.
The first two names were used by
Request 2 prompt:
[exactly the same as request 1 prompt]
Request 2 result:
the original architect L'Enfant to distinguish his work from that of other American cities.
User: Who lives in a pineapple under the sea
Environment and Context
Please provide detailed information about your computer setup. This is important in case the issue is not reproducible except for under certain specific conditions.
Physical (or virtual) hardware you are using, e.g. for Linux:
$ sysctl -a | grep machdep.cpu
machdep.cpu.cores_per_package: 12
machdep.cpu.core_count: 12
machdep.cpu.logical_per_package: 12
machdep.cpu.thread_count: 12
machdep.cpu.brand_string: Apple M2 Max
Operating System, e.g. for Linux:
$ uname -a
Darwin MacBook-Pro.local 22.6.0 Darwin Kernel Version 22.6.0: Mon May 22 20:21:10 PDT 2023; root:xnu-8796.140.17.505.3~3/RELEASE_ARM64_T6020 arm64
SDK version, e.g. for Linux:
$ python3 --version
Python 3.8.16
$ make --version
GNU Make 3.81
$ g++ --version
Apple clang version 14.0.3 (clang-1403.0.22.14.1)
Target: arm64-apple-darwin22.6.0
Thread model: posix
InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin
Steps to Reproduce
Please provide detailed steps for reproducing the issue. We are not sitting in front of your screen, so the more detail the better.
Pull the latest code and have the weights of a model ready
Text transcript of a never ending dialog, where User: interacts with an AI assistant named ChatBot.
ChatBot is helpful, kind, honest, friendly, good at writing and never fails to answer User:'s requests immediately and with details and precision.
There are no annotations like (30 seconds passed...) or (to himself), just what User: and ChatBot say aloud to each other.
The dialog lasts for years, the entirety of it is shared below. It's 10000 pages long.
The transcript only includes text, it does not include markup like HTML and Markdown.
User: Hello, ChatBot!
ChatBot: Hello User:! How may I help you today?
User: What year is it?
ChatBot: We are in 2023.
User: Please tell me the largest city in Europe.
ChatBot: The largest city in Europe is Moscow, the capital of Russia.
User: What can you tell me about Moscow?
ChatBot: Moscow, on the Moskva River in western Russia, is the nation's cosmopolitan capital. In its historic core is the Kremlin, a complex that's home to the president and tsarist treasures in the Armoury. Outside its walls is Red Square, Russia’s symbolic center.
User: What is a cat?
ChatBot: A cat is a domestic species of small carnivorous mammal. It is the only domesticated species in the family Felidae.
User: How do I pass command line arguments to a Node.js program?
ChatBot: The arguments are stored in process.argv.
argv[0] is the path to the Node. js executable.
argv[1] is the path to the script file.
argv[2] is the first argument passed to the script.
argv[3] is the second argument passed to the script and so on.
User: Name a color.
ChatBot: Blue.
User: What time is it?
ChatBot: It is 3:14 PM.
User: Who is the first president of the United States?
ChatBot: George Washington.
User: What is the capital of the United States?
ChatBot:
Run the same prompt again
Working Solution
At the load prompt logic in server.cpp, I changed this if statement from if (i < processed_tokens.size() && processed_tokens[i] == prompt_tokens[i])
to if (i < processed_tokens.size() && processed_tokens[i] == prompt_tokens[i]) && i < prompt_tokens.size()-1
So that when processed_tokens completely contain the prompt_tokens, the program will still go to the else block and re-evaluate the prompt from the last token of prompt_token, instead of keeping processed_tokens and n_past the same.
But I've not done thorough testing yet and am fairly new to the code base. Could this be a potential PR?
The text was updated successfully, but these errors were encountered:
Prerequisites
Expected Behavior
When the prompt of the current completion request is the same or contained in the prior request's prompt (tokens of which are now included in processed_tokens), e.i.
The expected behavior is that the server outputs tokens based on the correct current prompt "[tkn1][tkn2][tkn3]"
Current Behavior
Currently, if the predicted tokens from the last completion request (say
"[tkn1][tkn2][tkn3][tkn4]"
) was"[pred_tkn1][pred_tkn2][pred_tkn3]"
.The server will return output using
"[tkn1][tkn2][tkn3][tkn4][pred_tkn1][pred_tkn2][pred_tkn3]"
as the prompt.For example:
Requestion 1 prompt:
Request 1 result:
Request 2 prompt:
Request 2 result:
Environment and Context
Please provide detailed information about your computer setup. This is important in case the issue is not reproducible except for under certain specific conditions.
Steps to Reproduce
Please provide detailed steps for reproducing the issue. We are not sitting in front of your screen, so the more detail the better.
make server
./server -m models/13B/ggml-model-q4_0.bin --ctx_size 2048
n_predict: 32,
or any other small valueWorking Solution
At the load prompt logic in server.cpp, I changed this if statement from
if (i < processed_tokens.size() && processed_tokens[i] == prompt_tokens[i])
to
if (i < processed_tokens.size() && processed_tokens[i] == prompt_tokens[i]) && i < prompt_tokens.size()-1
So that when
processed_tokens
completely contain theprompt_tokens
, the program will still go to the else block and re-evaluate the prompt from the last token ofprompt_token
, instead of keepingprocessed_tokens
andn_past
the same.But I've not done thorough testing yet and am fairly new to the code base. Could this be a potential PR?
The text was updated successfully, but these errors were encountered: