Add FxHash and ShortStringOptimization. #1733

MeetThePatel · 2025-02-10T21:46:31Z

This PR contains performance optimizations (spreadsheet of benchmark comparisons linked below). The two main optimizations are:

Switching to FxHash instead of standard library default hasher. This is provided by the rustc_hash crate, which the compiler uses internally.
Switch from String to CompactString, provided by the compact_str crate, which has short-string-optimization for up to 24 bytes, which many tokens can fit inside of.

Progress:

This PR is not fully complete. At this point in time, the following tasks have been completed:

Convert base crate.
- Tests are passing, and benchmarks were run (linked below).
Convert Python bindings.
- The same tests that pass on HEAD are passing on this branch. There seems to be 2 failing tests on HEAD, but this branch fails the exact same tests, and in the same manner (i.e. same output).
- There seems to be an issue with my implementation of the pyo3. The benchmarks for the base crate are outperforming HEAD by a decent margin, but the benches/test_tiktoken.py are basically the same (within a margin of error) of HEAD. I think it is caused by unnecessary type conversions.
Convert Node bindings.
Cleanup.

Benchmarks:

The benchmarks are at the following link: Tokenizer Benchmarks. All benchmarks were run on a Macbook Pro with the M2 Pro chip and 16GB memory.

Additionally, the BPE Train vocabulary (huge) benchmark is one that I added (not committed, as that would warrant its own PR). This benchmark is using the dataset: One Billion Word Challenge, which clocks in at 4.15GB.

ArthurZucker · 2025-02-11T09:41:50Z

Sounds good! Node bidings are really not necessary!

ArthurZucker · 2025-02-11T09:41:58Z

Do you want this to be reviewed now?

MeetThePatel · 2025-02-11T14:38:11Z

Not yet. I'm still working on a few things:

See if there actually is a bottleneck in the Python-Rust interface (regarding extra type conversions), or if the benchmarks aren't pushing the crate hard enough to see a meaningful difference. It should be ~10-15% faster (according to the base crate BPE encode benchmark).
Clean up code quality.

MeetThePatel · 2025-02-11T22:55:33Z

I think this is ready for review.

Also, these are the distributions for the benchmark runs (blue is this PR, red is HEAD). Besides the speed being a bit faster, it seems to be more consistent as well (as least on my machine).

HEAD vs PR base crate benchmarks.pdf

MeetThePatel added 9 commits February 7, 2025 15:00

Switch tokenizers base library to rustc_hash.

59fc957

Convert tokenizers python bindings to rustc_hash

fe6f444

Convert tokenizers node bindings to rustc_hash

20efc0f

Add SSO using compact_str.

9436efb

Make decode generic for bindings wrapper structs.

6ad6047

Fix doctests.

1c67f46

Make function signature generic for bindings

7ad0262

Fix shortcomings of last commit.

b7173e3

Got Python bindings working.

0875ec8

MeetThePatel added 3 commits February 11, 2025 11:59

Remove tokenizers/utils/compact_string.rs

0c7ab4f

Cleanup.

39a2e80

Cleanup python bindings.

3f340bb

MeetThePatel marked this pull request as ready for review February 11, 2025 22:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add FxHash and ShortStringOptimization. #1733

Add FxHash and ShortStringOptimization. #1733

MeetThePatel commented Feb 10, 2025 •

edited

Loading

ArthurZucker commented Feb 11, 2025

ArthurZucker commented Feb 11, 2025

MeetThePatel commented Feb 11, 2025

MeetThePatel commented Feb 11, 2025

Add FxHash and ShortStringOptimization. #1733

Are you sure you want to change the base?

Add FxHash and ShortStringOptimization. #1733

Conversation

MeetThePatel commented Feb 10, 2025 • edited Loading

Progress:

Benchmarks:

ArthurZucker commented Feb 11, 2025

ArthurZucker commented Feb 11, 2025

MeetThePatel commented Feb 11, 2025

MeetThePatel commented Feb 11, 2025

MeetThePatel commented Feb 10, 2025 •

edited

Loading