Benchmark for tokenizers.
Utility to compare the performance of different tokenizers with different datasets.
Clone the repository and install the dependencies. Python and a package manager of choice have to be installed. Examples are given for pdm and rye.
git clone https://github.com/Systemcluster/tokenizer-bench
cd tokenizer-bench
pdm install # or rye sync
This will install a common subset of tokenizers. To install all tokenizers, install the optional dependencies.
pdm install -G :all # or rye sync --all-features
Note
Commands given in this section omit pdm run
, rye run
or your command of choice to run Python in a virtual environment.
Run the benchmark with the following command:
python -m bench
This will run the benchmark with all tokenizers, models and datasets, excluding combinations that are known not to complete in reasonable time. The results will be saved to the timings
directory if not otherwise specified with the --timings-dir
option.
Specific tokenizers, models or datasets can be selected with the --tokenizers
, --models
and --datasets
options. Each option accepts a comma-separated list of names. For example, to run the benchmark with only the kitoken
and sentencepiece
tokenizers and the wagahai
dataset, run:
python -m bench --tokenizers kitoken,sentencepiece --datasets wagahai
To exclude known slow combinations, use the --skip-slow
option. This will exclude tokenizer, model and dataset combinations that are known to take a long time to complete. To include combinations that are known not to complete, use the --allow-inf
option.
Run python -m bench --help
for a full list of options.
To show the results of a previous benchmark run, run the following command:
python -m bench --show-results
The --show-results
argument can be combined with the options to select specific tokenizers, models and datasets as described above. Use the --timings-dir
option to specify the directory containing the results to show, and the --compare-dir
option to compare with another set of results.
To generate encoding and decoding results for tokenizer implementations, run the following command:
python -m generate
This will encode and decode the test data with all tokenizers in the models
directory and save the results in the outputs
directory.
This benchmark measures the time it takes different tokenizers to encode inputs with different models and datasets. To ensure consistent timings, each tokenizer is, for every configuration, initialized and run in a separate subprocess with garbage collection disabled, and each subprocess is run in high-priority mode when started with appropriate permissions.
Benchmarks are run with a fixed number of iterations. The number and nature of iterations is chosen to be large enough to give a stable result, but small enough to complete in a reasonable amount of time.
The full benchmark takes a long time to complete. Use the provided options as preferred to run a subset of the benchmark. Combinations that are known not to complete in a reasonable amount of time are excluded by default unless the --allow-inf
option is set.
A selection of results is published in the Kitoken repository.
The following tokenizers are included in this benchmark:
- Kitoken
- SentencePiece 1
- Tokenizers
- Tiktoken 2
- Tekken uses Tiktoken internally
- gpt_bpe 1
- llama.cpp 2, optional, very slow
1: doesn't complete UTF-8 Sequence
with some models
2: doesn't complete UTF-8 Sequence
with any model unless raising timeout
This benchmark does not verify the correctness of tokenization results. Some tokenizers targeting the same model as others might produce different results for the same inputs. Use the test data generation utility to compare outputs manually if needed.
To add a new tokenizer, copy one of the existing modules in the bench/benches
directory and change the copy to run it. Afterwards add it to the tokenizers
dict in bench/utils/bench.py
.
See the models directory for the list of models included in this benchmark.
-
Pride and Prejudice: A text document containing Pride and Prejudice by Jane Austen. This data is a good representation for common English-language inputs containing a mix of short and long paragraphs.
-
UTF-8 Sequence: A text document containing a single-line UTF-8 sequence. This data is a good representation of inputs that might fail to split during pre-tokenization.
-
Wagahai: A text document containing Wagahai wa Neko de Aru by Natsume Sōseki. This data is a good representation for Japanese-language inputs containing many long paragraphs.
- Small Input: A text document containing a variety of short paragraphs.
- UTF-8 Input: A text document containing a mix of different languages and various unicode characters.
- Mixed Input: A large text document containing a shuffled mix of unicode characters.
See the data directory for the contents of each dataset.