-
Notifications
You must be signed in to change notification settings - Fork 10.9k
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Errors w/ BPE tokenizers (GGML_ASSERT: llama.cpp:2029: codepoints_from_utf8(word).size() > 0 and more) #4360
Comments
The BPE tokenizer was taken from a project of mine, it was accompanied by a slim unicode library (cmpnct_unicode.cpp) I ran into a similar issue as well with chinese tokens, when working with OpenBuddy which contains a large chinese bpe vocab as special tokens. I did not have time to properly debug or fix it but I use a quickfix on my end by just switching back to my lib - until the bug is fixed.
This all is quite a bit much to do, just showing what I did on my local fork of llama.cpp that's all not meant as a fix to the bug itself, just as a workaround and indication where it is. |
Maybe of interest, but for our extended tokenizer (and maybe other extended SentencePiece tokenizers like ELYZA's, a Japanese dev @mmnga was able to GGUF quant our model by using a slightly modified convert.py script that just adds the additional vocab in. (I though it would be really hard, but the diff looks not so bad?
|
I use a similar hack on models in my convert.py, if I recall right llama.cpp actually HAD that support half a year ago or so and for some reason it was removed from it. I guess the 3rd option is the best, that's likely why the support was dropped (too early) |
@cmp-nct I'm adding it back in. Give me about 2 - 3 days. |
This is still a problem for the new foundational models released by InternLM (which have been Llama-fied by Charles Goddard) ![]() |
In case of the InternLM2 model, the problem is with the token 354 I created a simple script that edits the sentencepiece model |
This issue is stale because it has been open for 30 days with no activity. |
This issue was closed because it has been inactive for 14 days since being marked as stale. |
Prerequisites
Please answer the following questions for yourself before submitting an issue.
Expected Behavior
I have a Mistral 7B based model shisa-7b-v1 that has an extended (128128) BPE tokenizer. This works fine and I have pulled the
vocab.json
fromtokenizer.json
(and there is aspecial_tokens_map.json
, and someadded_tokens
in thetokenizer.json
.I am able to convert the model with
--vocabtype bpe
with no errors.And I am actually able to run
llama_bench
on the model however when infererencing, I get this error:Current Behavior
As mentioned, there is an assert, here that get's triggered: https://github.com/ggerganov/llama.cpp/blob/bcc0eb4591bec5ec02fad3f2bdcb1b265052ea56/llama.cpp#L2695
I did a bit of poking and ended up hacking in a replacement token just to see if I could make it go:
I tried to get a codepoint, and it turns out it only triggers once, but sadly seems to be a literal null character?
Sadly this was not the only error once things got running, as this gets output as well
That's an aweful lot of special tokens? (there are only 4 in our special_tokens_map.json...
I modified the code to print out what tokens it thought were issues:
It prints out lots of regular tokens, not sure why it's expecting special tokens?
Once everything is loaded we get:
But I didn't follow up more, since it seems that there's somewhere either in the conversion or the token handling code that's beyond my ken and messed up.
Note, I found to related open discussions/issues (but I don't think they got past the initial assert) - they are both models that use extended BPE tokenizers I believe though:
These issues seem to be unresolved from a few months back, but reporting this new issue since hopefully it sheds some more light on what might be going on? Maybe the bpe conversion is actually broken?
In our base model card, we actually have a list of models using other tokenizers so that might also help in tracking down issues. StableLM Beta JAVocab and CALM2-7B are two more Llama2 models using non-standard tokenizers.
Environment and Context
Can relay more info if not reproducible but I don't think that's it
The text was updated successfully, but these errors were encountered: