Improving the repetition penalty #331

Piezoid · 2023-03-20T15:43:12Z

129c7d1 (#20) added a repetition penalty that prevent the model to run into loops.

Here are a few suggestions for possible enhancements:

One issue with the interactive mode is that the repetition penalty is affecting the anti-prompt and response prefix, causing the model to generate unnecessarily long responses. One solution could be to exclude these tokens from the penalty,
It is possible to exempt or reduce the penalty for stop words, punctuation characters, and newlines; maybe applying a frequency-based penalty instead,
Using an exponential decay, such that recent tokens are more penalized than older ones, causing less issues with large repeat_last_n windows,
Token repetition is an approximation of sub-strings or word repetition, but it seems difficult to do otherwise without backtracking the inference.

The text was updated successfully, but these errors were encountered:

j-f1 · 2023-03-20T16:55:14Z

There are numerical weights associated with each token in the tokenizer, it may be useful to use those in this calculation somehow.

Piezoid · 2023-03-21T18:29:19Z

I figured that we can get the repetitions length by comparing the recent token history against the histories prior to the past occurrences of the candidate tokens. In other words, find the longest suffix shared between the text up to the current position, and the sub-strings at the left of the previous occurrences of the candidate tokens.

For each token and past occurrence, we have an age or distance in the text d, and a repetition length l+1 (left-extension of l tokens, plus the token we are about to add).

I've thrown together a test using an ad-hoc repetition score exp(-k * d / (l + 1))$\in ]0,1]$. With it, I blend between no penalization and full repeat_penalty. It seems reasonable, but I have the feeling that I'm reinventing the wheel.

I have not tested this extensively, but the initial results are interesting. For example here's a song sampled almost greedily (--temp 0.15 with 7B Q4_0): it is not very focused, but there is almost no stuttering.

Me llamo LLaMa, soy una llama!

Yo no me gusta el frío.

Mi casa es un caloroso hogar,
con una chimenea muy grande.
Tengo un cachito de carbón,
que me calientan los pies.

Yo no me gustaría un frío,
como el que hay en la nieve.
Me gusto un calor muy bueno,
un calentito como yo.

Cuando me pongo a cantar,
me gusta un calorcito.
Y cuando voy a dormir,
mi casa se queda muy fría.

Pero cuando me despierto,
algo caliente viene a mi lado.
Es un calentito muy bueno,
que me hace sentir bien.

Esto es lo que yo digo:
¡Caliente, caliente! ¡Yo no me gusto el frío!
Me llaman LLaMa, soy una llama.
Yo no me gusta el frío.

Other classic examples ("I love you, I love") seems ok at low temperatures. I have yet to try generating code or using alpaca with it.
I'm unsure how to reliably evaluate the penalization schemes. I don't really expect a noticeable effect on the perplexity, but I could try with repetitive text samples that incite the model to repeat itself at unexpected places.

My implementation extends the repetitions character by character, rather than token by token. It should be more efficient to work directly on the tokens. However, my concern is that the models may attempt to cheat using its morphological skills acquired during training by subword sampling.

There are numerical weights associated with each token in the tokenizer, it may be useful to use those in this calculation somehow.

@j-f1 Yes, it would be nice to reuse the the frequencies from BPE. I've not access to it in my working branch, as I'm still using #66 with the old model files.

This should help reducing the penalization of punctuation tokens. Currently, it is not that bad since punctuation marks and stop words are often drawn from sharp distributions (low entropy) on which the penalization has little influence.

Piezoid · 2023-03-27T14:05:27Z

I've been trying to predict the token frequency from the tokenizer model's "scores".
Looking at the score distribution, it seems that -score is actually the rank of the token, with a few special cases:

Some tokens have a score == -1e9: These are mostly space tokens that shall not be merged ▁▁ ▁▁▁▁ ▁▁▁▁▁▁▁▁ ▁▁▁▁▁ ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▁▁▁▁▁▁ ▁▁▁▁▁▁▁▁▁▁▁▁ ▁▁▁▁▁▁▁▁▁▁▁▁▁ ▁▁▁▁▁▁▁▁▁▁ ▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▁▁▁ ▁▁▁▁▁▁▁▁▁ ▁▁▁▁▁▁▁ ▁▁▁▁▁▁▁▁▁▁▁ ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▁
Some tokens have score == 0, mostly control and non-ascii bytes: <unk> <s> </s> <0x00> <0x01> <0x02> <0x03> [...] <0xFE> <0xFF>
Some tokens are missing (max(ranks) > len(ranks))

I've used the token frequency in wiki-text-2. The 50 most frequent tokens are:

▁the ▁, ▁. ▁of ▁and ▁in ▁to ▁a ▁= ▁" ▁@ ▁was ▁\' ▁The ▁as ▁that ▁on ▁for ▁with ▁by ▁) ▁( ▁is ▁from ed ▁at ing ▁his ▁were ▁it ▁he ▁an ▁In ▁had ▁which ▁be ▁are ▁; ▁not ▁their ▁but ▁A es ▁first ▁– ▁also ▁its ▁or ▁: ers

However, the tokenizer's top50 tokens are (according to tokenizer score):

▁t er in ▁a en on ▁th es ▁s ▁d at or an ▁c is re it ▁the ar le ▁w ▁p ou al ▁f ▁m ed ▁o ▁b om ion ing ic as el ent ▁in ▁h nd et ▁l ▁n st ▁to ch ▁I ro il ▁of de

I'm not sure what is going on. The two rank distributions are only slightly correlated as show by plotting the rank wiki-text vs. tokenizer's ranks:

wiki-text frequencies and ranks are a good fit for the Zipfs law, using a maximum likelihood method:

As expected predicting the frequencies from the tokenizer score doesn't work as well:

Notebook for the analysis
Resulting equation:

$$ p(x) \simeq \frac{(-\text{score}(x))^{-0.837}}{27.7} $$

The accuracy might be enough for limiting the penalization applied to the most frequent tokens. But there is likely something wrong. Would really appreciate it if someone could let me know if I overlooked something.

Piezoid · 2023-03-27T14:24:13Z

I have a prototype on this branch for the decaying repetition penalty weighted by repeat length.
By default, this should generate the same results than the master branch. The exponential decay is enabled by replacing the --repeat_last_n option with the new --repeat_half_life option. They are mutually exclusive.

For example, --repeat_half_life 16 implies that:

Repeating the last token, will cost a full --repeat_penalty penalty.
A 16 tokens old, 1 token long repetition will be half-penalized;
A 32 token old, 1 token long, will be receive a quarter of the penalization;
A 32 tokens old, 2 tokens long repetition will be half-penalized; etc.

I've found that this new penalization heuristic helps when sampling at low temperatures. I recommend to increase the --repeat_penalty a bit (1.2-1.4). Also, because it doesn't account for the token frequencies, a increased penalty may cause issues with punctuation, stop words, newlines, especially when generating code.

anzz1 · 2023-03-27T14:58:44Z

@Piezoid

Thanks for pointing me towards here from the other discussion. I'll be checking your branch out and testing it. I'm also rooting for you finishing the trace tool at some point, I see that it could be highly valuable. Especially for larger scale testing and graphing, but even for smaller cases like debugging and general interest it's a cool feature to have to be able to see how and why the decisions were made and which tokens are what, etc.

I had the idea some time ago to have a command line option to output that stuff to console, maybe redirect stderr but tbh your idea is a lot better. Using the binary format saves space if someone wants to do something crazy like a 7day run, and it can be easily turned into a log file, graph or whatever by another tool. Great idea altogether 💡

Disclaimer though to not expect too much 😄

my CPU is too low powered to do any proper quantitative analysis like using perplexity tool or make 100s of runs with the Trace model outputs to a binary file #477 tracer to have some output. I just haven't had the need to upgrade but I'm thinking of upgrading soon since the newfound interest in this as there is a lot of things I would like to do but simply cannot rn due to lacking hardware (i5-6600k 4c/4t lul 🚀).
To be perfectly honest I dont really know wtf i'm doing half the time and while I do pretty much understand the concepts at play here regarding token selection and probabilities I cannot really visualize which parameter affects what, like the logic chain of it when you tweak this value it causes this and that to change in this and that way and this is why.
I'm kinda just shooting blind and testing various models and tweaking stuff randomly and seeing what comes out. So all my research so far is very anecdotal and subjective and maybe not super valuable, but that is not saying that it doesn't have any value. Some things like creativity, behaviour or quality of a story are things which are pretty hard to quantitatively assess anyway after all.

LostRuins · 2023-03-28T09:23:48Z

Just throwing 2c at past implementations:

OpenAI uses 2 variables for this - they have a presence penalty and a frequency penalty. The current implementation of rep pen in llama.cpp is equivalent to a presence penalty, adding an additional penalty based on frequency of tokens in the penalty window might be worth exploring too.
KoboldAI instead uses a group of 3 values, what we call "Repetition Penalty", a "Repetition Penalty Slope" and a "Repetition Penalty Range". This is the repetition penalty value applied as a sigmoid interpolation between the Repetition Penalty value (at the most recent token) and 1.0 (at the end of the Repetition Penalty Range). The defaults we use for this are 1.1 rep pen, 1024 range and 0.7 slope which provides what our community agrees to be relatively decent results across most models.

Piezoid · 2023-03-28T11:03:50Z

@LostRuins Thank you for the references.

Do you know how are naturally reoccurring tokens (such as \n, ▁, ▁the, ▁,, ▁., etc) handled by these samplers? It is unclear if we should penalize the punctuation and stop-word tokens in the same way than more specific tokens. The model is quite confident when predicting these tokens and doesn't seem to be too deterred by the penalty. However, I've noticed that high penalties force the model into writing longer sentences and using less new lines.

The decaying penalty in KoboldAI appears to be similar to the one in my prototype. The main difference is that I'm using an exponential decay instead of sigmoid one. Also, the repeat distance/age is divided by the length of the repeated sequence of tokens.

LostRuins · 2023-03-29T16:59:47Z

I think there is no special handling - all tokens are treated equally except for intentionally banned tokens.

ralyodio · 2023-05-15T03:56:29Z

what are valid values for repeat penalty?

LostRuins · 2023-05-15T06:53:54Z

Anything between 1 and 2 is a good start.

Anyeos · 2023-06-29T09:22:09Z

I tested in practice a lot of numbers for each parameter. I noted that it is far from being for continuous use in interactive mode because it late or early got stuck in a loop (ie.: always the same answer regardless of user input).
Increasing temp near 2 or more starts outputing near random words (I think I reached a limit).
Increasing penalty to 2 or more is no usable neither because it tries to say the most possible different and ends saying anything but the prompt.
I was testing the mirostat mode and it ends being useless because make worst the chat over the time. mirostat 2 is worse than 1. I think 2 is still in early development.

So, some working values at beginning but becoming worse over time are:

--ctx_size 1024
--batch-size 2048
--temp 0.82
--top-k 30
--top-p 1.8
--tfs 2.0
--typical 1.0
--keep -1
--repeat-last-n 1024
--repeat-penalty 1.2
--no-penalize-nl
--mirostat 1
--mirostat-lr 0.5
--mirostat-ent 4.0

I just want to share my values so we can make comparisons to improve it.
I expect an interactive chat that is sustained over time as if it was being started recently but only "remembering" some past messages.
The actual result is a becoming more and more monotone chat at it finally ends in some loop or repetitive answer.
I tested a lot of numbers in each every parameter without success, always the same result: repetitive at end. And that happens in only on 10 or 20 messages, so the end is not too much far neither.

Thank you everyone for this great project. Not perfect but now I have a high quality LLM AI thanks to you all, so I am thankful despite the failures.

[Improving the repetition penalty · Issue #331 · ggerganov/llama.cpp](ggml-org/llama.cpp#331 (comment))

Piezoid · 2023-09-14T13:23:49Z

Closing this as out of date. Several alternatives have helped this issue since then: #1126, #2135, #2593.

gjmulder added enhancement New feature or request generation quality Quality of model output labels Mar 20, 2023

Piezoid mentioned this issue Mar 24, 2023

Trace model outputs to a binary file #477

Closed

Piezoid mentioned this issue Apr 10, 2023

Changing default repeat_last_n value to current context size? #787

Closed

Piezoid mentioned this issue Apr 22, 2023

Sample interface, new samplers, #1126

Merged

nlpander mentioned this issue May 2, 2023

Repetition Penalty nlpander/guanaco#6

Open

WolframRavenwolf added a commit to WolframRavenwolf/simple-proxy-for-tavern that referenced this issue Jul 17, 2023

Update deterministic.json

81878dd

[Improving the repetition penalty · Issue #331 · ggerganov/llama.cpp](ggml-org/llama.cpp#331 (comment))

Piezoid mentioned this issue Sep 14, 2023

Implementation of a sequence repetition penalty sampler #2593

Draft

Piezoid closed this as completed Sep 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improving the repetition penalty #331

Improving the repetition penalty #331

Piezoid commented Mar 20, 2023 •

edited

Loading

j-f1 commented Mar 20, 2023

Piezoid commented Mar 21, 2023

Piezoid commented Mar 27, 2023 •

edited

Loading

Piezoid commented Mar 27, 2023

anzz1 commented Mar 27, 2023 •

edited

Loading

LostRuins commented Mar 28, 2023

Piezoid commented Mar 28, 2023

LostRuins commented Mar 29, 2023

ralyodio commented May 15, 2023

LostRuins commented May 15, 2023

Anyeos commented Jun 29, 2023

Piezoid commented Sep 14, 2023

Improving the repetition penalty #331

Improving the repetition penalty #331

Comments

Piezoid commented Mar 20, 2023 • edited Loading

j-f1 commented Mar 20, 2023

Piezoid commented Mar 21, 2023

Piezoid commented Mar 27, 2023 • edited Loading

Piezoid commented Mar 27, 2023

anzz1 commented Mar 27, 2023 • edited Loading

LostRuins commented Mar 28, 2023

Piezoid commented Mar 28, 2023

LostRuins commented Mar 29, 2023

ralyodio commented May 15, 2023

LostRuins commented May 15, 2023

Anyeos commented Jun 29, 2023

Piezoid commented Sep 14, 2023

Piezoid commented Mar 20, 2023 •

edited

Loading

Piezoid commented Mar 27, 2023 •

edited

Loading

anzz1 commented Mar 27, 2023 •

edited

Loading