Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

using the negate character (^) in a grammar with a sequence #2888

Closed
pacmanincarnate opened this issue Aug 30, 2023 · 4 comments
Closed

using the negate character (^) in a grammar with a sequence #2888

pacmanincarnate opened this issue Aug 30, 2023 · 4 comments
Labels

Comments

@pacmanincarnate
Copy link

pacmanincarnate commented Aug 30, 2023

In a grammar, the negate symbol does not work as anticipated with a sequence of characters (a string). Rather than negating the combination, it negates each letter individually. For instance [^”chapter”]+ will allow generation of any characters other than c, h, a, p, t, e, or r, rather than the tokens that make up “chapter” in order. This makes it impossible to negate a specific word or phrase.
Ideally, we would be able to negate a sequence of characters rather than just a range or individual characters.

@pacmanincarnate pacmanincarnate changed the title using the negate character (^) with a sequence using the negate character (^) in a grammar with a sequence Aug 30, 2023
@KerfuffleV2
Copy link
Collaborator

KerfuffleV2 commented Sep 1, 2023

[blah] is specifically a character set match, the same syntax works in regular expressions for example. It would be really confusing if it (or its negation) worked differently and changing this would break... basically everything.

Certainly it would be useful to be able to negative strings (I'm not familiar enough with the grammar stuff to know if that even exists already). I suspect it might be a hard thing to implement. You may or may not already know this, but LLMs generate tokens - one at a time - and tokens aren't words or letters, they're basically arbitrary chunks of text. Let's say you want to forbid "I like foxes" from being generated. This tokenizes like:

   306 -> ' I'
   763 -> ' like'
  1701 -> ' fo'
  9100 -> 'xes'
 29991 -> '!'

The LLM only generates a token at a time and it doesn't know stuff like what tokens are going to be penalized in the future. You also obviously can't ban all tokens (or sequences) that lead to "I like foxes". Banning "I" would be ridiculous, I, like, etc. You can't even ban I, like, fo because the LLM could be trying to write something like "I like food".

So what will happen if you had this kind of sequence forbidding function is the LLM will generate I, like, fo and then the grammar sampling will set the probability for xes to -infinity so it can't be generated. Now the LLM has to pick something else. Maybe it'll try to write the thing using single letter tokens, maybe it'll write something nonsensical because there are no other reasonable choices.

So I think even if you could do this, you probably wouldn't want to. You'd generally get better results using the grammar to steer it toward generating what you want rather than forbidding stuff you don't, because really, you can only forbid the very last token in the sequence that would make it match your negative pattern. LLM says: "I can't generate xes? Okay, let's just say I like foxxes!. There you go buddy, it didn't match the pattern you said I couldn't write! Happy now?" Probably not.

@pacmanincarnate
Copy link
Author

That's a good point and something I had thought of. I was hoping there might be a way, since you can specify a string in the positive. Do you know, by chance, if you can negate a token, rather than character? I've never seen this implemented in a grammar, but as a lot of words are tokens, that might be a reasonable solution for negation of many words.

Thanks for the response.

@KerfuffleV2
Copy link
Collaborator

No problem.

Do you know, by chance, if you can negate a token, rather than character?

As far as I know, that doesn't currently exist. I haven't used the grammar stuff personally but I've never seen anything about that. I don't think it's something that would really be too hard to add but I think it might be too hard to use in a practical way.

You'd have to know how stuff tokenizes before creating the grammar, and it also would probably only work with models in the same family. I.E. a Falcon model might tokenize stuff differently from a LLaMA2 model so you'd have to know those details when you were writing the grammar.


Another approach to dealing with unwanted sequences of tokens is to rewind the history and ban the start of the part you don't want to see. I've been experimenting with that in my sequence repetition sampler project and it seems to work pretty well. This is a potential way the grammar stuff could handle what you were talking about originally but of course it doesn't currently.

It's also a fairly complicated thing to do (I actually don't even know how to make the grammar compatible with my rewinds) and there's also a performance costs to rewinding since you have to regenerate tokens.

@github-actions github-actions bot added the stale label Mar 25, 2024
Copy link
Contributor

github-actions bot commented Apr 9, 2024

This issue was closed because it has been inactive for 14 days since being marked as stale.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants