Is there internal state caching for use with llama_eval's n_past? #1111

DannyDaemonic · 2023-04-21T19:11:43Z

DannyDaemonic
Apr 21, 2023

It was always my understanding that transformer models have a fixed context window with a single internal state and when it's filled, one simply decides what's important (taking the prompt, for example, along with the last half of the evaluated tokens), and resets the entire transformer, starting from 0 with the chosen tokens in an attempt to maintain some form of continuity.

However, I've noticed that instead of reevaluating the initial tokens, we can just call llama_eval with a previous positional value for n_past, and it seems to resume right where we want it to. If the prompt is "Here's a funny joke:" and the length is 10 tokens, we can set n_past to 10 at any future evaluation point and it doesn't need to reevaluate those first 10 tokens. I even tested this by inferring the start of the joke, evaluating a single newline token with an n_past of 10, and continued to infer from that point and got the start of a new joke. I timed the evaluation, and the single token evaluation was consistent with a single token evaluation. So, it's clearly not just reevaluating the initial prompt.

Am I missing something with how these transformer models work, or is this implementation doing something special to store and reuse intermediate states, or have some sort of caching for past Q, K, and V matrices that allows us to jump back to any previous point? I tried to step through the code and while there were some places n_past seem to be used as an index, I didn't see any point where it compared n_past to some position or length variable to restore a previous state.

DannyDaemonic · 2023-04-23T14:23:10Z

DannyDaemonic
Apr 23, 2023
Author

Since things are changing pretty fast, I should mention this behavior predates llama_state.

0 replies

mthuurne · 2023-04-28T21:43:35Z

mthuurne
Apr 28, 2023

I'm stilling trying to make sense of the code myself, so don't take this as an authoritative answer, but hopefully it can help you in your search.

I found a KV cache in struct llama_model, even though it probably shouldn't be located there (see TODO comment).

None of the things currently stored by llama_copy_state_data() are evaluation state:

rng is used for sampling only, not for evaluation
logits is used to pass evaluation output to sampling, it's not used in future evaluations
embedding is an optional recording of the processed evaluation input; it's only used for printing in examples/embedding/embedding.cpp, I don't know if it has any use besides debugging
the KV cache is used in evaluation, but if I understood correctly, it is a pure cache and contains no information that would be lost on a rebuild of the cache, therefore no state

2 replies

DannyDaemonic Apr 29, 2023
Author

So is it just something inherent to the nature of these transformers that allows us to jump back to any previous input state and begin evaluation from that point? All I'm left with is that as we move forward building the context, it's "filled in" in a linear manner where each new token evaluation adds the full state of the evaluation to the context and this property allows us to jump back into that context at any time, update the evaluation, and continue evaluating from that position. Which isn't a terrible stretch, since you build on the context as you go, but I imagined you'd save only the bare minimum to continue forward. Maybe the bare minimum is the full state at that point.

SlyEcho Apr 30, 2023
Collaborator

No, the KV cache + n_past is the state. It is a cache in a sense that without it, every eval would need the full string of tokens.

Eval takes a batch of N tokens (or a single one when predicting) and calculates the K and V data for it, it is added to the cache at the position n_past .. n_past+N for K and V separately.

But the self-attention then reads back the full context data from 0 .. n_past+N from the cache for both K and V which is used for the rest of the calculations.

You can follow along in llama_eval_internal() in llama.cpp where it says // self-attention

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is there internal state caching for use with llama_eval's n_past? #1111

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Is there internal state caching for use with llama_eval's n_past? #1111

DannyDaemonic Apr 21, 2023

Replies: 2 comments · 2 replies

DannyDaemonic Apr 23, 2023 Author

mthuurne Apr 28, 2023

DannyDaemonic Apr 29, 2023 Author

SlyEcho Apr 30, 2023 Collaborator

DannyDaemonic
Apr 21, 2023

Replies: 2 comments 2 replies

DannyDaemonic
Apr 23, 2023
Author

mthuurne
Apr 28, 2023

DannyDaemonic Apr 29, 2023
Author

SlyEcho Apr 30, 2023
Collaborator