Add more tokenizer tests #3742

Galunid · 2023-10-23T11:20:21Z

Conversion scripts used 96981f3

Issues:

Persimmon script doesn't allow for --vocab-only,
GPT-Neox tokenizer fails with std::unordered_map illegal access seen in other gpt2 tokenizer based models. I applied the fix from Missing tokenizer tests #3730 (comment) and the test passed. I didn't include the passing version here, only the failing one, let me know which one you want @goerch
Refact fails with byte not found in vocab (you can see in CI)
Starcoder fails with byte not found in vocab (you can see in CI)

Models used:

Baichuan - https://huggingface.co/baichuan-inc/Baichuan-7B
Bloom - https://huggingface.co/bigscience/bloom-1b7/
GptNeoX - https://huggingface.co/EleutherAI/gpt-neox-20b/
Refact - https://huggingface.co/smallcloudai/Refact-1_6B-fim
Starcoder - https://huggingface.co/bigcode/starcoder

closes #3730

goerch · 2023-10-23T14:06:36Z

I'm testing refact and test-tokenizer-1-bpe is failing

    // TODO: why doesn't this work for the full range of Unicodes?
    // for (uint32_t cp = 0x10000; cp < 0x0010ffff; ++cp) {
    for (uint32_t cp = 0x10000; cp < 0x00080000; ++cp) {
[...]

for codepoint 0x40000. That doesn't look like a random problem. I'm not sure if we should only check assigned Unicode planes here (my unicode-fu is weak) or if this is a real problem?

goerch · 2023-10-23T14:45:09Z

I'm not sure if we should only check assigned Unicode planes here (my unicode-fu is weak) or if this is a real problem?

Restricting test-tokenizer-1-bpe to the assigned Unicode planes like so

    // NOTE: only testing assigned Unicode planes
    // for (uint32_t cp = 0x10000; cp < 0x0010ffff; ++cp) {
    for (uint32_t cp = 0x10000; cp < 0x00040000; ++cp) {
        std::string str = codepoint_to_utf8(cp);
        std::vector<llama_token> tokens = llama_tokenize(ctx, str, false);
        std::string check = llama_detokenize_bpe(ctx, tokens);
        if (str != check) {
            fprintf(stderr, "%s : error: codepoint %x detokenizes to '%s'(%zu) instead of '%s'(%zu)\n",
                __func__, cp, check.c_str(), check.length(), str.c_str(), str.length());
            return 4;
        }
    }
    for (uint32_t cp = 0x000e0000; cp < 0x0010ffff; ++cp) {
        std::string str = codepoint_to_utf8(cp);
        std::vector<llama_token> tokens = llama_tokenize(ctx, str, false);
        std::string check = llama_detokenize_bpe(ctx, tokens);
        if (str != check) {
            fprintf(stderr, "%s : error: codepoint %x detokenizes to '%s'(%zu) instead of '%s'(%zu)\n",
                __func__, cp, check.c_str(), check.length(), str.c_str(), str.length());
            return 4;
        }
    }

fixes refact and is working for all my models (didn't check starcoder).

Galunid · 2023-10-23T15:06:03Z

After applying your change to the test starcoder passes too

$ make test
Running tests...
Test project /home/kris/llama.cpp/build
      Start  1: test-quantize-fns
 1/18 Test  #1: test-quantize-fns ................   Passed    0.02 sec
      Start  2: test-quantize-perf
 2/18 Test  #2: test-quantize-perf ...............   Passed    0.07 sec
      Start  3: test-sampling
 3/18 Test  #3: test-sampling ....................   Passed    0.01 sec
      Start  4: test-tokenizer-0-llama
 4/18 Test  #4: test-tokenizer-0-llama ...........   Passed    0.07 sec
      Start  5: test-tokenizer-0-falcon
 5/18 Test  #5: test-tokenizer-0-falcon ..........   Passed    0.29 sec
      Start  6: test-tokenizer-1-llama
 6/18 Test  #6: test-tokenizer-1-llama ...........   Passed    2.88 sec
      Start  7: test-tokenizer-1-baichuan
 7/18 Test  #7: test-tokenizer-1-baichuan ........   Passed    3.09 sec
      Start  8: test-tokenizer-1-falcon
 8/18 Test  #8: test-tokenizer-1-falcon ..........   Passed    1.88 sec
      Start  9: test-tokenizer-1-aquila
 9/18 Test  #9: test-tokenizer-1-aquila ..........   Passed    1.95 sec
      Start 10: test-tokenizer-1-mpt
10/18 Test #10: test-tokenizer-1-mpt .............   Passed    1.33 sec
      Start 11: test-tokenizer-1-bloom
11/18 Test #11: test-tokenizer-1-bloom ...........   Passed    4.56 sec
      Start 12: test-tokenizer-1-gpt-neox
12/18 Test #12: test-tokenizer-1-gpt-neox ........Subprocess aborted***Exception:   0.91 sec
      Start 13: test-tokenizer-1-refact
13/18 Test #13: test-tokenizer-1-refact ..........   Passed    1.24 sec
      Start 14: test-tokenizer-1-starcoder
14/18 Test #14: test-tokenizer-1-starcoder .......   Passed    1.26 sec
      Start 15: test-grammar-parser
15/18 Test #15: test-grammar-parser ..............   Passed    0.00 sec
      Start 16: test-llama-grammar
16/18 Test #16: test-llama-grammar ...............   Passed    0.00 sec
      Start 17: test-grad0
17/18 Test #17: test-grad0 .......................   Passed    4.41 sec
      Start 18: test-rope
18/18 Test #18: test-rope ........................   Passed    0.04 sec

94% tests passed, 1 tests failed out of 18

Total Test time (real) =  24.01 sec

goerch · 2023-10-23T15:21:09Z

Start 12: test-tokenizer-1-gpt-neox
12/18 Test #12: test-tokenizer-1-gpt-neox ........Subprocess aborted***Exception: 0.91 sec

Interesting. I see

      Start 12: test-tokenizer-1-gpt-neox
12/17 Test #12: test-tokenizer-1-gpt-neox ........   Passed    1.87 sec

Hm.

Galunid · 2023-10-23T15:23:46Z

Pay no attention to neox, it's not relevant here (uses old model that was failing on map illegal access)

Galunid · 2023-10-23T15:51:01Z

Should I also include your patch to restrict tokenizer tests to unicode planes in this PR?

goerch · 2023-10-23T15:51:50Z

Should I also include your patch to restrict tokenizer tests to unicode planes in this PR?

Yes, please.

tests/test-tokenizer-1-bpe.cpp

goerch

Nicely done!

goerch · 2023-10-23T17:02:20Z

@Galunid : sorry for mixing it up when using Github online editor (will never try again;). May I ask you to repair the mess (don't know enough about git to fix it on your branch)? Thanks again.

Galunid · 2023-10-23T18:00:11Z

@goerch Done, no worries ;)

models/ggml-vocab-bloom.gguf

goerch · 2023-10-23T20:02:14Z

OK, I'll wait until tomorrow morning (+10 hours) before merging.

* master: (350 commits) speculative : ensure draft and target model vocab matches (ggml-org#3812) llama : correctly report GGUFv3 format (ggml-org#3818) simple : fix batch handling (ggml-org#3803) cuda : improve text-generation and batched decoding performance (ggml-org#3776) server : do not release slot on image input (ggml-org#3798) batched-bench : print params at start log : disable pid in log filenames server : add parameter -tb N, --threads-batch N (ggml-org#3584) (ggml-org#3768) server : do not block system prompt update (ggml-org#3767) sync : ggml (conv ops + cuda MSVC fixes) (ggml-org#3765) cmake : add missed dependencies (ggml-org#3763) cuda : add batched cuBLAS GEMM for faster attention (ggml-org#3749) Add more tokenizer tests (ggml-org#3742) metal : handle ggml_scale for n%4 != 0 (close ggml-org#3754) Revert "make : add optional CUDA_NATIVE_ARCH (ggml-org#2482)" issues : separate bug and enhancement template + no default title (ggml-org#3748) Update special token handling in conversion scripts for gpt2 derived tokenizers (ggml-org#3746) llama : remove token functions with `context` args in favor of `model` (ggml-org#3720) Fix baichuan convert script not detecing model (ggml-org#3739) make : add optional CUDA_NATIVE_ARCH (ggml-org#2482) ...

Galunid added 2 commits October 23, 2023 13:12

Add more tokenizer tests

6919b8c

Add starcoder

769135b

Galunid mentioned this pull request Oct 23, 2023

Update special token handling in conversion scripts for gpt2 derived tokenizers #3746

Merged

Update test vocab files

226f0a8

Restrict bpe tokenizer tests to unicode planes

a06e82b

goerch reviewed Oct 23, 2023

View reviewed changes

tests/test-tokenizer-1-bpe.cpp Outdated Show resolved Hide resolved

Update comment

1244b00

goerch mentioned this pull request Oct 23, 2023

gguf-py: Add support for loading merges.txt #3743

Closed

goerch approved these changes Oct 23, 2023

View reviewed changes

Galunid force-pushed the add-tokenizer-tests branch from efd3c22 to 1244b00 Compare October 23, 2023 17:57

Comment cosmetics

38de5d0

ggerganov reviewed Oct 23, 2023

View reviewed changes

models/ggml-vocab-bloom.gguf Outdated Show resolved Hide resolved

Remove bloom vocab/test

a01ddfb

goerch merged commit daab3d7 into ggml-org:master Oct 24, 2023

Galunid mentioned this pull request Oct 24, 2023

StableLM support #3586

Merged

12 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add more tokenizer tests #3742

Add more tokenizer tests #3742

Galunid commented Oct 23, 2023 •

edited

Loading

goerch commented Oct 23, 2023 •

edited

Loading

goerch commented Oct 23, 2023

Galunid commented Oct 23, 2023

goerch commented Oct 23, 2023

Galunid commented Oct 23, 2023

Galunid commented Oct 23, 2023

goerch commented Oct 23, 2023

goerch left a comment

goerch commented Oct 23, 2023 •

edited

Loading

Galunid commented Oct 23, 2023

goerch commented Oct 23, 2023

Add more tokenizer tests #3742

Add more tokenizer tests #3742

Conversation

Galunid commented Oct 23, 2023 • edited Loading

goerch commented Oct 23, 2023 • edited Loading

goerch commented Oct 23, 2023

Galunid commented Oct 23, 2023

goerch commented Oct 23, 2023

Galunid commented Oct 23, 2023

Galunid commented Oct 23, 2023

goerch commented Oct 23, 2023

goerch left a comment

Choose a reason for hiding this comment

goerch commented Oct 23, 2023 • edited Loading

Galunid commented Oct 23, 2023

goerch commented Oct 23, 2023

Galunid commented Oct 23, 2023 •

edited

Loading

goerch commented Oct 23, 2023 •

edited

Loading

goerch commented Oct 23, 2023 •

edited

Loading