Make the permanent prompt permanent #1019

sergedc · 2023-04-17T03:46:20Z

Expected Behavior

n_keep tokens (the params.prompt (e.g. alpaca.txt)) are always part of the context and does not need to be recalculated.

Current Behavior

auto embd_inp = ::llama_tokenize(ctx, params.prompt, true);

embd_inp is the params.prompt (e.g. alpaca.txt)

params.n_keep = (int)embd_inp.size();

n_keep is the size of the permanent prompt (e.g. alpaca.txt)

n_past = params.n_keep;
embd.insert(embd.begin(), last_n_tokens.begin() + n_ctx - n_left/2 - embd.size(), last_n_tokens.end() - embd.size());
n_past += embd.size();

embd now has a certain amount of token from last_n_tokens + the original embd. But no longer has the permanent prompt (e.g. alpaca.txt)
n_past = size of embd +n_keep (size of permanent prompt (e.g. alpaca.txt)). But in the context, the n_keep token before embd are NOT the permanent prompt (e.g. alpaca.txt). The permanant prompt is all the way at the begining of last_n_tokens.

Are my statements correct?

Suggestions:

To solve for that we could:
embd.insert(embd.begin(), last_n_tokens.begin() + n_ctx - n_left/2 - embd.size() - n_keep, last_n_tokens.end() - embd.size());
embd.insert(embd.begin(), last_n_tokens.begin(), n_keep);

Now we have : permanent prompt (e.g. alpaca.txt) + the old context we kept + the original embd.

Is this right?

Problem: this would basically recompute the permanent prompt (e.g. alpaca.txt) every time the context reach the max size.
Why is this a problem? I run a model where the permanent prompt is 1000 tokens (multi shot prompt) and the questions are 250 tokens. Hence recomputing the permanent prompt everytime is painfull.
Question: How to we recover / save the computation of the permanent prompt and then bring it back when the context is full?

SlyEcho · 2023-04-17T07:39:14Z

No.

embd only has new tokens to be evaluated, the kept tokens from the beginning do not need to be evaluated again, this is the whole idea of this performance feature.

The LLaMa model doesn't need to see the tokens themselves, the only necessary parameter is n_past which you can see always will include n_keep. The model will get the past token data from the KV cache.

If there is something I would improve in the code is to keep a representation of the exact context that the model has at the moment around. This way n_keep could be derived simply by getting the length of the initial common substring (of tokens) of the new text and the old.

EDIT: I should also mention that last_n_tokens is kind of special in that it remembers all tokens, even if the context is truncated, but it is not used for evaluation, only for sampling.

sergedc · 2023-04-17T21:29:36Z

last_n_tokens is not the actual context. I understand that. Is there a way to see the actual context? Is that what you would like to be able to see?

n_past is the number of token reused from the past tokens (ie the context). It is n_past tokens starting from the end or the beginning of the context?

I don't understand where the context is being truncated following the line if ((n_past + (int) embd.size() > n_ctx))
The only line of code is embd.insert(), which will ultimately add more to the context. Where is the line that truncates the context?

Thanks a ton for your help.

SlyEcho · 2023-04-17T21:48:54Z

I don't understand where the context is being truncated following the line if ((n_past + (int) embd.size() > n_ctx))

It is the line:

n_past = params.n_keep;

That is it. That is all the model needs to know. The model will now calculate as if only n_keep tokens have been evaluated. You can see that n_past is a parameter into the evaluation function. It doesn't need the actual tokens. The state is actually stored in the KV cache.

embd contains new tokens to be evaluated. The complicated-looking insert() adds some of the last seen tokens into it before the new tokens form the user. Note that last_n_tokens are always added to the end of this array so that's why it is calculating it in that way.

F16_KV appears to have been removed here: ggml-org@af99c6f This addresses two issues: - ggml-org#995 which just requests to add the KV cache offloading param - ggml-org#1006 a NULL ptr exception when using the embeddings (introduced by leaving f16_kv in the fields struct)

sergedc changed the title ~~[User] Insert summary of your issue or enhancement..~~ Make the permanent prompt permanent Apr 17, 2023

sergedc closed this as completed Apr 19, 2023

Bearsaerker mentioned this issue Mar 12, 2025

Eval bug: Gemma 3 extremly slow prompt processing when using quantized kv cache. #12352

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make the permanent prompt permanent #1019

Make the permanent prompt permanent #1019

sergedc commented Apr 17, 2023

SlyEcho commented Apr 17, 2023 •

edited

Loading

sergedc commented Apr 17, 2023 •

edited

Loading

SlyEcho commented Apr 17, 2023 •

edited

Loading

Make the permanent prompt permanent #1019

Make the permanent prompt permanent #1019

Comments

sergedc commented Apr 17, 2023

Expected Behavior

Current Behavior

Suggestions:

SlyEcho commented Apr 17, 2023 • edited Loading

sergedc commented Apr 17, 2023 • edited Loading

SlyEcho commented Apr 17, 2023 • edited Loading

SlyEcho commented Apr 17, 2023 •

edited

Loading

sergedc commented Apr 17, 2023 •

edited

Loading

SlyEcho commented Apr 17, 2023 •

edited

Loading