Skip to content

Add more tokenizer tests #3742

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Merged
merged 7 commits into from
Oct 24, 2023
Merged

Conversation

Galunid
Copy link
Collaborator

@Galunid Galunid commented Oct 23, 2023

Conversion scripts used 96981f3

Issues:

  • Persimmon script doesn't allow for --vocab-only,

  • GPT-Neox tokenizer fails with std::unordered_map illegal access seen in other gpt2 tokenizer based models. I applied the fix from Missing tokenizer tests #3730 (comment) and the test passed. I didn't include the passing version here, only the failing one, let me know which one you want @goerch

  • Refact fails with byte not found in vocab (you can see in CI)

  • Starcoder fails with byte not found in vocab (you can see in CI)

Models used:

closes #3730

@goerch
Copy link
Collaborator

goerch commented Oct 23, 2023

I'm testing refact and test-tokenizer-1-bpe is failing

    // TODO: why doesn't this work for the full range of Unicodes?
    // for (uint32_t cp = 0x10000; cp < 0x0010ffff; ++cp) {
    for (uint32_t cp = 0x10000; cp < 0x00080000; ++cp) {
[...]

for codepoint 0x40000. That doesn't look like a random problem. I'm not sure if we should only check assigned Unicode planes here (my unicode-fu is weak) or if this is a real problem?

@goerch
Copy link
Collaborator

goerch commented Oct 23, 2023

I'm not sure if we should only check assigned Unicode planes here (my unicode-fu is weak) or if this is a real problem?

Restricting test-tokenizer-1-bpe to the assigned Unicode planes like so

    // NOTE: only testing assigned Unicode planes
    // for (uint32_t cp = 0x10000; cp < 0x0010ffff; ++cp) {
    for (uint32_t cp = 0x10000; cp < 0x00040000; ++cp) {
        std::string str = codepoint_to_utf8(cp);
        std::vector<llama_token> tokens = llama_tokenize(ctx, str, false);
        std::string check = llama_detokenize_bpe(ctx, tokens);
        if (str != check) {
            fprintf(stderr, "%s : error: codepoint %x detokenizes to '%s'(%zu) instead of '%s'(%zu)\n",
                __func__, cp, check.c_str(), check.length(), str.c_str(), str.length());
            return 4;
        }
    }
    for (uint32_t cp = 0x000e0000; cp < 0x0010ffff; ++cp) {
        std::string str = codepoint_to_utf8(cp);
        std::vector<llama_token> tokens = llama_tokenize(ctx, str, false);
        std::string check = llama_detokenize_bpe(ctx, tokens);
        if (str != check) {
            fprintf(stderr, "%s : error: codepoint %x detokenizes to '%s'(%zu) instead of '%s'(%zu)\n",
                __func__, cp, check.c_str(), check.length(), str.c_str(), str.length());
            return 4;
        }
    }

fixes refact and is working for all my models (didn't check starcoder).

@Galunid
Copy link
Collaborator Author

Galunid commented Oct 23, 2023

After applying your change to the test starcoder passes too

$ make test
Running tests...
Test project /home/kris/llama.cpp/build
      Start  1: test-quantize-fns
 1/18 Test  #1: test-quantize-fns ................   Passed    0.02 sec
      Start  2: test-quantize-perf
 2/18 Test  #2: test-quantize-perf ...............   Passed    0.07 sec
      Start  3: test-sampling
 3/18 Test  #3: test-sampling ....................   Passed    0.01 sec
      Start  4: test-tokenizer-0-llama
 4/18 Test  #4: test-tokenizer-0-llama ...........   Passed    0.07 sec
      Start  5: test-tokenizer-0-falcon
 5/18 Test  #5: test-tokenizer-0-falcon ..........   Passed    0.29 sec
      Start  6: test-tokenizer-1-llama
 6/18 Test  #6: test-tokenizer-1-llama ...........   Passed    2.88 sec
      Start  7: test-tokenizer-1-baichuan
 7/18 Test  #7: test-tokenizer-1-baichuan ........   Passed    3.09 sec
      Start  8: test-tokenizer-1-falcon
 8/18 Test  #8: test-tokenizer-1-falcon ..........   Passed    1.88 sec
      Start  9: test-tokenizer-1-aquila
 9/18 Test  #9: test-tokenizer-1-aquila ..........   Passed    1.95 sec
      Start 10: test-tokenizer-1-mpt
10/18 Test #10: test-tokenizer-1-mpt .............   Passed    1.33 sec
      Start 11: test-tokenizer-1-bloom
11/18 Test #11: test-tokenizer-1-bloom ...........   Passed    4.56 sec
      Start 12: test-tokenizer-1-gpt-neox
12/18 Test #12: test-tokenizer-1-gpt-neox ........Subprocess aborted***Exception:   0.91 sec
      Start 13: test-tokenizer-1-refact
13/18 Test #13: test-tokenizer-1-refact ..........   Passed    1.24 sec
      Start 14: test-tokenizer-1-starcoder
14/18 Test #14: test-tokenizer-1-starcoder .......   Passed    1.26 sec
      Start 15: test-grammar-parser
15/18 Test #15: test-grammar-parser ..............   Passed    0.00 sec
      Start 16: test-llama-grammar
16/18 Test #16: test-llama-grammar ...............   Passed    0.00 sec
      Start 17: test-grad0
17/18 Test #17: test-grad0 .......................   Passed    4.41 sec
      Start 18: test-rope
18/18 Test #18: test-rope ........................   Passed    0.04 sec

94% tests passed, 1 tests failed out of 18

Total Test time (real) =  24.01 sec

@goerch
Copy link
Collaborator

goerch commented Oct 23, 2023

Start 12: test-tokenizer-1-gpt-neox
12/18 Test #12: test-tokenizer-1-gpt-neox ........Subprocess aborted***Exception: 0.91 sec

Interesting. I see

      Start 12: test-tokenizer-1-gpt-neox
12/17 Test #12: test-tokenizer-1-gpt-neox ........   Passed    1.87 sec

Hm.

@Galunid
Copy link
Collaborator Author

Galunid commented Oct 23, 2023

Pay no attention to neox, it's not relevant here (uses old model that was failing on map illegal access)

@Galunid
Copy link
Collaborator Author

Galunid commented Oct 23, 2023

Should I also include your patch to restrict tokenizer tests to unicode planes in this PR?

@goerch
Copy link
Collaborator

goerch commented Oct 23, 2023

Should I also include your patch to restrict tokenizer tests to unicode planes in this PR?

Yes, please.

Copy link
Collaborator

@goerch goerch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nicely done!

@goerch
Copy link
Collaborator

goerch commented Oct 23, 2023

@Galunid : sorry for mixing it up when using Github online editor (will never try again;). May I ask you to repair the mess (don't know enough about git to fix it on your branch)? Thanks again.

@Galunid Galunid force-pushed the add-tokenizer-tests branch from efd3c22 to 1244b00 Compare October 23, 2023 17:57
@Galunid
Copy link
Collaborator Author

Galunid commented Oct 23, 2023

@goerch Done, no worries ;)

@goerch
Copy link
Collaborator

goerch commented Oct 23, 2023

OK, I'll wait until tomorrow morning (+10 hours) before merging.

@goerch goerch merged commit daab3d7 into ggml-org:master Oct 24, 2023
@Galunid Galunid mentioned this pull request Oct 24, 2023
12 tasks
mattgauf added a commit to mattgauf/llama.cpp that referenced this pull request Oct 27, 2023
* master: (350 commits)
  speculative : ensure draft and target model vocab matches (ggml-org#3812)
  llama : correctly report GGUFv3 format (ggml-org#3818)
  simple : fix batch handling (ggml-org#3803)
  cuda : improve text-generation and batched decoding performance (ggml-org#3776)
  server : do not release slot on image input (ggml-org#3798)
  batched-bench : print params at start
  log : disable pid in log filenames
  server : add parameter -tb N, --threads-batch N (ggml-org#3584) (ggml-org#3768)
  server : do not block system prompt update (ggml-org#3767)
  sync : ggml (conv ops + cuda MSVC fixes) (ggml-org#3765)
  cmake : add missed dependencies (ggml-org#3763)
  cuda : add batched cuBLAS GEMM for faster attention (ggml-org#3749)
  Add more tokenizer tests (ggml-org#3742)
  metal : handle ggml_scale for n%4 != 0 (close ggml-org#3754)
  Revert "make : add optional CUDA_NATIVE_ARCH (ggml-org#2482)"
  issues : separate bug and enhancement template + no default title (ggml-org#3748)
  Update special token handling in conversion scripts for gpt2 derived tokenizers (ggml-org#3746)
  llama : remove token functions with `context` args in favor of `model` (ggml-org#3720)
  Fix baichuan convert script not detecing model (ggml-org#3739)
  make : add optional CUDA_NATIVE_ARCH (ggml-org#2482)
  ...
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Missing tokenizer tests
3 participants