-
Notifications
You must be signed in to change notification settings - Fork 11.4k
Do not recreate context while LLama is writing #828
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Comments
Seems like it's due to context swapping. The context limit of llama is 2048 tokens, after that they do a "context swapping":
https://github.com/ggerganov/llama.cpp/blob/master/examples/main/main.cpp#L256 |
@ngxson This makes sense. This is invisible in, say, ChatGPT because there, this context recreation happens only after it has finished writing - when it's the user's turn. |
Would it make sense to track how full the context is in interactive mode? So that we could swap the context (or in this case clear a part of it) while the user is typing the next question? |
It could also work like ChatGPT. There, the context is recreated every time the user sends a message. The tokens in the message are counted, the max response length is added to that and then as much history is prepended to that. Though I don't know what that would be performance-wise, as context recreation seems rather expensive in Here, I think a better solution would be to recreate the context as soon as LLama stops typing. We would assume that the user's query + LLama's response must be no longer than a certain limit. |
This issue was closed because it has been inactive for 14 days since being marked as stale. |
Prerequisites
Please answer the following questions for yourself before submitting an issue.
Expected Behavior
Tokens are generated at about a constant rate, ie. N tokens per second on a given machine.
Current Behavior
Sometimes, the LLM takes a much longer time to generate a token than usually. It can be a 10x slowdown.
Environment and Context
Setup
MacBook Pro 14-inch 2021
10-core Apple M1 Pro CPU
16 GB RAM
OS
MacOS Ventura 13.3 (22E252)
clang --version
Steps to Reproduce
Run
./main
-m
./models/ggml-vicuna-7b-4bit-rev1.bin
-n
512
--color
-f
prompts/chat-with-vicuna.txt
--seed
42
--mlock
The model will get stuck after "of":
...or visit one of▏
the city's many restaurants...Failure Logs
Video
The text was updated successfully, but these errors were encountered: