-
Notifications
You must be signed in to change notification settings - Fork 10.9k
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Improving the repetition penalty #331
Comments
There are numerical weights associated with each token in the tokenizer, it may be useful to use those in this calculation somehow. |
I figured that we can get the repetitions length by comparing the recent token history against the histories prior to the past occurrences of the candidate tokens. In other words, find the longest suffix shared between the text up to the current position, and the sub-strings at the left of the previous occurrences of the candidate tokens. For each token and past occurrence, we have an age or distance in the text I've thrown together a test using an ad-hoc repetition score I have not tested this extensively, but the initial results are interesting. For example here's a song sampled almost greedily ( Me llamo LLaMa, soy una llama!
Other classic examples ("I love you, I love") seems ok at low temperatures. I have yet to try generating code or using alpaca with it. My implementation extends the repetitions character by character, rather than token by token. It should be more efficient to work directly on the tokens. However, my concern is that the models may attempt to cheat using its morphological skills acquired during training by subword sampling.
@j-f1 Yes, it would be nice to reuse the the frequencies from BPE. I've not access to it in my working branch, as I'm still using #66 with the old model files. This should help reducing the penalization of punctuation tokens. Currently, it is not that bad since punctuation marks and stop words are often drawn from sharp distributions (low entropy) on which the penalization has little influence. |
I've been trying to predict the token frequency from the tokenizer model's "scores".
I've used the token frequency in
However, the tokenizer's top50 tokens are (according to tokenizer score):
I'm not sure what is going on. The two rank distributions are only slightly correlated as show by plotting the rank wiki-text vs. tokenizer's ranks: wiki-text frequencies and ranks are a good fit for the Zipfs law, using a maximum likelihood method: As expected predicting the frequencies from the tokenizer score doesn't work as well: Notebook for the analysis The accuracy might be enough for limiting the penalization applied to the most frequent tokens. But there is likely something wrong. Would really appreciate it if someone could let me know if I overlooked something. |
I have a prototype on this branch for the decaying repetition penalty weighted by repeat length. For example,
I've found that this new penalization heuristic helps when sampling at low temperatures. I recommend to increase the |
Thanks for pointing me towards here from the other discussion. I'll be checking your branch out and testing it. I'm also rooting for you finishing the trace tool at some point, I see that it could be highly valuable. Especially for larger scale testing and graphing, but even for smaller cases like debugging and general interest it's a cool feature to have to be able to see how and why the decisions were made and which tokens are what, etc. I had the idea some time ago to have a command line option to output that stuff to console, maybe redirect stderr but tbh your idea is a lot better. Using the binary format saves space if someone wants to do something crazy like a 7day run, and it can be easily turned into a log file, graph or whatever by another tool. Great idea altogether 💡 Disclaimer though to not expect too much 😄
|
Just throwing 2c at past implementations:
|
@LostRuins Thank you for the references. Do you know how are naturally reoccurring tokens (such as The decaying penalty in KoboldAI appears to be similar to the one in my prototype. The main difference is that I'm using an exponential decay instead of sigmoid one. Also, the repeat distance/age is divided by the length of the repeated sequence of tokens. |
I think there is no special handling - all tokens are treated equally except for intentionally banned tokens. |
what are valid values for repeat penalty? |
Anything between 1 and 2 is a good start. |
I tested in practice a lot of numbers for each parameter. I noted that it is far from being for continuous use in interactive mode because it late or early got stuck in a loop (ie.: always the same answer regardless of user input). So, some working values at beginning but becoming worse over time are:
I just want to share my values so we can make comparisons to improve it. Thank you everyone for this great project. Not perfect but now I have a high quality LLM AI thanks to you all, so I am thankful despite the failures. |
[Improving the repetition penalty · Issue #331 · ggerganov/llama.cpp](ggml-org/llama.cpp#331 (comment))
129c7d1 (#20) added a repetition penalty that prevent the model to run into loops.
Here are a few suggestions for possible enhancements:
repeat_last_n
windows,The text was updated successfully, but these errors were encountered: