-
Notifications
You must be signed in to change notification settings - Fork 10.9k
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
using the negate character (^) in a grammar with a sequence #2888
Comments
Certainly it would be useful to be able to negative strings (I'm not familiar enough with the grammar stuff to know if that even exists already). I suspect it might be a hard thing to implement. You may or may not already know this, but LLMs generate tokens - one at a time - and tokens aren't words or letters, they're basically arbitrary chunks of text. Let's say you want to forbid "I like foxes" from being generated. This tokenizes like:
The LLM only generates a token at a time and it doesn't know stuff like what tokens are going to be penalized in the future. You also obviously can't ban all tokens (or sequences) that lead to "I like foxes". Banning "I" would be ridiculous, So what will happen if you had this kind of sequence forbidding function is the LLM will generate So I think even if you could do this, you probably wouldn't want to. You'd generally get better results using the grammar to steer it toward generating what you want rather than forbidding stuff you don't, because really, you can only forbid the very last token in the sequence that would make it match your negative pattern. LLM says: "I can't generate |
That's a good point and something I had thought of. I was hoping there might be a way, since you can specify a string in the positive. Do you know, by chance, if you can negate a token, rather than character? I've never seen this implemented in a grammar, but as a lot of words are tokens, that might be a reasonable solution for negation of many words. Thanks for the response. |
No problem.
As far as I know, that doesn't currently exist. I haven't used the grammar stuff personally but I've never seen anything about that. I don't think it's something that would really be too hard to add but I think it might be too hard to use in a practical way. You'd have to know how stuff tokenizes before creating the grammar, and it also would probably only work with models in the same family. I.E. a Falcon model might tokenize stuff differently from a LLaMA2 model so you'd have to know those details when you were writing the grammar. Another approach to dealing with unwanted sequences of tokens is to rewind the history and ban the start of the part you don't want to see. I've been experimenting with that in my sequence repetition sampler project and it seems to work pretty well. This is a potential way the grammar stuff could handle what you were talking about originally but of course it doesn't currently. It's also a fairly complicated thing to do (I actually don't even know how to make the grammar compatible with my rewinds) and there's also a performance costs to rewinding since you have to regenerate tokens. |
This issue was closed because it has been inactive for 14 days since being marked as stale. |
In a grammar, the negate symbol does not work as anticipated with a sequence of characters (a string). Rather than negating the combination, it negates each letter individually. For instance [^”chapter”]+ will allow generation of any characters other than c, h, a, p, t, e, or r, rather than the tokens that make up “chapter” in order. This makes it impossible to negate a specific word or phrase.
Ideally, we would be able to negate a sequence of characters rather than just a range or individual characters.
The text was updated successfully, but these errors were encountered: