Is there internal state caching for use with llama_eval's n_past? #1111
Unanswered
DannyDaemonic
asked this question in
Q&A
Replies: 2 comments 2 replies
-
Since things are changing pretty fast, I should mention this behavior predates |
Beta Was this translation helpful? Give feedback.
0 replies
-
I'm stilling trying to make sense of the code myself, so don't take this as an authoritative answer, but hopefully it can help you in your search. I found a KV cache in None of the things currently stored by
|
Beta Was this translation helpful? Give feedback.
2 replies
# for free
to join this conversation on GitHub.
Already have an account?
# to comment
-
It was always my understanding that transformer models have a fixed context window with a single internal state and when it's filled, one simply decides what's important (taking the prompt, for example, along with the last half of the evaluated tokens), and resets the entire transformer, starting from 0 with the chosen tokens in an attempt to maintain some form of continuity.
However, I've noticed that instead of reevaluating the initial tokens, we can just call
llama_eval
with a previous positional value forn_past
, and it seems to resume right where we want it to. If the prompt is "Here's a funny joke:" and the length is 10 tokens, we can setn_past
to 10 at any future evaluation point and it doesn't need to reevaluate those first 10 tokens. I even tested this by inferring the start of the joke, evaluating a single newline token with ann_past
of 10, and continued to infer from that point and got the start of a new joke. I timed the evaluation, and the single token evaluation was consistent with a single token evaluation. So, it's clearly not just reevaluating the initial prompt.Am I missing something with how these transformer models work, or is this implementation doing something special to store and reuse intermediate states, or have some sort of caching for past Q, K, and V matrices that allows us to jump back to any previous point? I tried to step through the code and while there were some places
n_past
seem to be used as an index, I didn't see any point where it comparedn_past
to some position or length variable to restore a previous state.Beta Was this translation helpful? Give feedback.
All reactions