Undo generate X most recent tokens - technically feasible? #2946

KerfuffleV2 · 2023-09-01T01:19:55Z

KerfuffleV2
Sep 1, 2023
Collaborator

I know this feature doesn't exist currently. There's also a naive way to do this where you just save the whole state every token, then you can restore to whatever point (or use a ring buffer that saves a certain number). That would use a lot of memory (and be pretty slow copying the state around also).

Is there a better way? Let's assume there's nothing weird going on like wrapping contexts, ropescaling, non-LLaMa models, etc. Can I look at the state (I think it's just KV states?) like:

| | | | | | |
 ^

at the beginning and then

|X|X|X| | | |
       ^

after 3 tokens have been generated. Then if I want to erase the previous token and try to regenerate it, I can zero out that "slot" in the state, move the position back a token and try again (possibly after setting whatever the logit for the previously generated token was to -inf). Is something like that possible in theory at least?

Answered by ggerganov

Sep 1, 2023

The n_past variable controls how much KV cache the llama_eval uses - i.e. it is the index. You can decrease it to "forget" the last token

View full answer

ggerganov · 2023-09-01T14:08:28Z

ggerganov
Sep 1, 2023
Maintainer

The n_past variable controls how much KV cache the llama_eval uses - i.e. it is the index. You can decrease it to "forget" the last token

6 replies

ggerganov Sep 1, 2023
Maintainer

Yes, unless I misunderstand the goal, it is very easy to discard the cache. A recent example of utilizing this is the speculative PR

In long term, we will need to extend the interface to not just discard the last tokens from the KV cache, but also to be able to keep or pick certain indices. Or even shift the cache (needed for faster ppl calculations) But to do that, we need to resolve #2060

KerfuffleV2 Sep 1, 2023
Collaborator Author

Thanks for the answer! I think you understood what I was talking about.

we will need to extend the interface to not just discard the last tokens from the KV cache, but also to be able to keep or pick certain indices.

Is it correct to think about the KV cache and tokens as there's a chunk of memory in that tensor (or tensors) that's associated with a token at a certain index. So for example, if I wanted to delete a token in the middle I'd calculate what range in the KV cache was associated with the token for that index and then basically memcpy the cache for positions at the token indexes past that point over it.

|1|2|3|4|5| |
           ^

If I wanted to delete 3 here:

|1|2|4|5| | |
         ^

and this would be the same as if the model had generated |1|2|4|5| originally (you'd have to run an evaluation after the edit to get the logits). Can I think about it like that?

ggerganov Sep 3, 2023
Maintainer

It's almost like this. The extra thing that you have to take into account for this to work is that each token data in the KV cache currently has been RoPEd. I.e. the information about the position at which this token was cached is embedded in the data. So if you simply move the cache data for that token to another position you will get wrong results.

Discarding tokens from the end of the cache by decreasing n_past is fine, because no tokens in the cache change their position.
But for any other operation, we would need to account for that (#2060)

KerfuffleV2 Sep 5, 2023
Collaborator Author

I'm interested in writing a general interface for editing the history like this: insert, delete, undo, etc which I think would make #2060 really easy. Unfortunately, stuff like changing how the model is actually evaluated is not something I understand well enough to mess with at the moment. It seems like it would probably involve reversing the roping K part of #775. #3024 mentions storing non-roped K so maybe I can just wait a bit and let the problem magically solve itself.

Or perhaps there's someone that would be interested in collaborating and handling the model graph stuff that's required for the feature. If there's a relatively simple way to look at the state as chunks of memory that can be moved around I can handle the details of making an API to use it.

Thanks for taking the time to answer my questions. I know managing a project like this is pretty demanding (and not the only thing in your life to worry about). Definitely not something I take for granted and it's appreciated!

ggerganov Sep 7, 2023
Maintainer

Yup, we should probably solve #2060 first and then we can improve the KV cache managing API to allow more complex operations.

ghost · 2023-09-02T13:10:39Z

ghost
Sep 2, 2023

I may be misunderstanding & I know server & main are significantly different, but is demo2 a server-based implementation of what you're describing?

3 replies

KerfuffleV2 Sep 2, 2023
Collaborator Author

but is #2777 (comment) a server-based implementation of what you're describing?

I don't think so, but it's something that could possibly benefit from the kind of thing I'm talking about. (The server itself might be doing something similar internally, but the frontend wouldn't matter in that case.)

What I'm talking about is undoing generated tokens without having to reevaluate the prompt. The simple approach, of course, is to just drop the unwanted tokens and then run prompt evaluation on the whole thing but that's fairly slow.

ghost Sep 2, 2023

OK, the idea to drop the unwanted tokens then running prompt eval is essentially optimizing exit main, then rerun with -p "(wanted tokens)",.. right?

I think you want to delete/remove unwanted tokens, and skip the eval, yeah?

Edit: re: demo2, I was suggesting server must have something similar going on internally because User sends an initial message, receives a denial, then Edits to approval, and it generates seemingly without evaluating Users inital message again.

KerfuffleV2 Sep 2, 2023
Collaborator Author

is essentially optimizing exit main, then rerun with -p "(wanted tokens)",.. right?

Basically, yes.

and it generates seemingly without evaluating Users inital message again.

I think it just looks that way because the prompt is pretty short. The pause before generation seems about the same for the original prompt eval compared to the edited one. I'm pretty sure it's just evaluating everything again. I skimmed the server source as well and I didn't see any logic to keep track of a prompt and avoid reevaluation. I only skimmed it though, so it's possible I missed something like that.

Anyway, I already got a good answer to my original question. If the followup works the way thought, I'm going to look at making a general interface for editing the tokens. I think being able to do that sort of stuff easily will open up a lot of possibilities.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Undo generate X most recent tokens - technically feasible? #2946

{{title}}

Replies: 2 comments 9 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Undo generate X most recent tokens - technically feasible? #2946

KerfuffleV2 Sep 1, 2023 Collaborator

Replies: 2 comments · 9 replies

ggerganov Sep 1, 2023 Maintainer

ggerganov Sep 1, 2023 Maintainer

KerfuffleV2 Sep 1, 2023 Collaborator Author

ggerganov Sep 3, 2023 Maintainer

KerfuffleV2 Sep 5, 2023 Collaborator Author

ggerganov Sep 7, 2023 Maintainer

ghost Sep 2, 2023

KerfuffleV2 Sep 2, 2023 Collaborator Author

ghost Sep 2, 2023

KerfuffleV2 Sep 2, 2023 Collaborator Author

KerfuffleV2
Sep 1, 2023
Collaborator

Replies: 2 comments 9 replies

ggerganov
Sep 1, 2023
Maintainer

ggerganov Sep 1, 2023
Maintainer

KerfuffleV2 Sep 1, 2023
Collaborator Author

ggerganov Sep 3, 2023
Maintainer

KerfuffleV2 Sep 5, 2023
Collaborator Author

ggerganov Sep 7, 2023
Maintainer

ghost
Sep 2, 2023

KerfuffleV2 Sep 2, 2023
Collaborator Author

KerfuffleV2 Sep 2, 2023
Collaborator Author