-
Notifications
You must be signed in to change notification settings - Fork 11.5k
Make the permanent prompt permanent #1019
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Comments
No.
The LLaMa model doesn't need to see the tokens themselves, the only necessary parameter is If there is something I would improve in the code is to keep a representation of the exact context that the model has at the moment around. This way EDIT: I should also mention that |
last_n_tokens is not the actual context. I understand that. Is there a way to see the actual context? Is that what you would like to be able to see? n_past is the number of token reused from the past tokens (ie the context). It is n_past tokens starting from the end or the beginning of the context? I don't understand where the context is being truncated following the line if ((n_past + (int) embd.size() > n_ctx)) Thanks a ton for your help. |
It is the line: n_past = params.n_keep; That is it. That is all the model needs to know. The model will now calculate as if only
|
F16_KV appears to have been removed here: ggml-org@af99c6f This addresses two issues: - ggml-org#995 which just requests to add the KV cache offloading param - ggml-org#1006 a NULL ptr exception when using the embeddings (introduced by leaving f16_kv in the fields struct)
Expected Behavior
n_keep tokens (the params.prompt (e.g. alpaca.txt)) are always part of the context and does not need to be recalculated.
Current Behavior
auto embd_inp = ::llama_tokenize(ctx, params.prompt, true);
params.n_keep = (int)embd_inp.size();
n_past = params.n_keep;
embd.insert(embd.begin(), last_n_tokens.begin() + n_ctx - n_left/2 - embd.size(), last_n_tokens.end() - embd.size());
n_past += embd.size();
Are my statements correct?
Suggestions:
To solve for that we could:
embd.insert(embd.begin(), last_n_tokens.begin() + n_ctx - n_left/2 - embd.size() - n_keep, last_n_tokens.end() - embd.size());
embd.insert(embd.begin(), last_n_tokens.begin(), n_keep);
Is this right?
Problem: this would basically recompute the permanent prompt (e.g. alpaca.txt) every time the context reach the max size.
Why is this a problem? I run a model where the permanent prompt is 1000 tokens (multi shot prompt) and the questions are 250 tokens. Hence recomputing the permanent prompt everytime is painfull.
Question: How to we recover / save the computation of the permanent prompt and then bring it back when the context is full?
The text was updated successfully, but these errors were encountered: