using the negate character (^) in a grammar with a sequence #2888

pacmanincarnate · 2023-08-30T01:45:18Z

In a grammar, the negate symbol does not work as anticipated with a sequence of characters (a string). Rather than negating the combination, it negates each letter individually. For instance [^”chapter”]+ will allow generation of any characters other than c, h, a, p, t, e, or r, rather than the tokens that make up “chapter” in order. This makes it impossible to negate a specific word or phrase.
Ideally, we would be able to negate a sequence of characters rather than just a range or individual characters.

KerfuffleV2 · 2023-09-01T01:06:27Z

[blah] is specifically a character set match, the same syntax works in regular expressions for example. It would be really confusing if it (or its negation) worked differently and changing this would break... basically everything.

Certainly it would be useful to be able to negative strings (I'm not familiar enough with the grammar stuff to know if that even exists already). I suspect it might be a hard thing to implement. You may or may not already know this, but LLMs generate tokens - one at a time - and tokens aren't words or letters, they're basically arbitrary chunks of text. Let's say you want to forbid "I like foxes" from being generated. This tokenizes like:

   306 -> ' I'
   763 -> ' like'
  1701 -> ' fo'
  9100 -> 'xes'
 29991 -> '!'

The LLM only generates a token at a time and it doesn't know stuff like what tokens are going to be penalized in the future. You also obviously can't ban all tokens (or sequences) that lead to "I like foxes". Banning "I" would be ridiculous, I, like, etc. You can't even ban I, like, fo because the LLM could be trying to write something like "I like food".

So what will happen if you had this kind of sequence forbidding function is the LLM will generate I, like, fo and then the grammar sampling will set the probability for xes to -infinity so it can't be generated. Now the LLM has to pick something else. Maybe it'll try to write the thing using single letter tokens, maybe it'll write something nonsensical because there are no other reasonable choices.

So I think even if you could do this, you probably wouldn't want to. You'd generally get better results using the grammar to steer it toward generating what you want rather than forbidding stuff you don't, because really, you can only forbid the very last token in the sequence that would make it match your negative pattern. LLM says: "I can't generate xes? Okay, let's just say I like foxxes!. There you go buddy, it didn't match the pattern you said I couldn't write! Happy now?" Probably not.

pacmanincarnate · 2023-09-16T04:10:28Z

That's a good point and something I had thought of. I was hoping there might be a way, since you can specify a string in the positive. Do you know, by chance, if you can negate a token, rather than character? I've never seen this implemented in a grammar, but as a lot of words are tokens, that might be a reasonable solution for negation of many words.

Thanks for the response.

KerfuffleV2 · 2023-09-16T08:03:26Z

No problem.

Do you know, by chance, if you can negate a token, rather than character?

As far as I know, that doesn't currently exist. I haven't used the grammar stuff personally but I've never seen anything about that. I don't think it's something that would really be too hard to add but I think it might be too hard to use in a practical way.

You'd have to know how stuff tokenizes before creating the grammar, and it also would probably only work with models in the same family. I.E. a Falcon model might tokenize stuff differently from a LLaMA2 model so you'd have to know those details when you were writing the grammar.

Another approach to dealing with unwanted sequences of tokens is to rewind the history and ban the start of the part you don't want to see. I've been experimenting with that in my sequence repetition sampler project and it seems to work pretty well. This is a potential way the grammar stuff could handle what you were talking about originally but of course it doesn't currently.

It's also a fairly complicated thing to do (I actually don't even know how to make the grammar compatible with my rewinds) and there's also a performance costs to rewinding since you have to regenerate tokens.

github-actions · 2024-04-09T01:06:29Z

This issue was closed because it has been inactive for 14 days since being marked as stale.

pacmanincarnate changed the title ~~using the negate character (^) with a sequence~~ using the negate character (^) in a grammar with a sequence Aug 30, 2023

KerfuffleV2 mentioned this issue Sep 1, 2023

Implementation of a sequence repetition penalty sampler #2593

Draft

github-actions bot added the stale label Mar 25, 2024

github-actions bot closed this as completed Apr 9, 2024

ExtReMLapin mentioned this issue Aug 9, 2024

Feature Request: [GRAMMAR] Easier way to negate string ((^) with sequence) #8953

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

using the negate character (^) in a grammar with a sequence #2888

using the negate character (^) in a grammar with a sequence #2888

pacmanincarnate commented Aug 30, 2023 •

edited

Loading

KerfuffleV2 commented Sep 1, 2023 •

edited

Loading

pacmanincarnate commented Sep 16, 2023

KerfuffleV2 commented Sep 16, 2023

github-actions bot commented Apr 9, 2024

using the negate character (^) in a grammar with a sequence #2888

using the negate character (^) in a grammar with a sequence #2888

Comments

pacmanincarnate commented Aug 30, 2023 • edited Loading

KerfuffleV2 commented Sep 1, 2023 • edited Loading

pacmanincarnate commented Sep 16, 2023

KerfuffleV2 commented Sep 16, 2023

github-actions bot commented Apr 9, 2024

pacmanincarnate commented Aug 30, 2023 •

edited

Loading

KerfuffleV2 commented Sep 1, 2023 •

edited

Loading