Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Add FxHash and ShortStringOptimization. #1733

Open
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

MeetThePatel
Copy link

@MeetThePatel MeetThePatel commented Feb 10, 2025

This PR contains performance optimizations (spreadsheet of benchmark comparisons linked below). The two main optimizations are:

  • Switching to FxHash instead of standard library default hasher. This is provided by the rustc_hash crate, which the compiler uses internally.

  • Switch from String to CompactString, provided by the compact_str crate, which has short-string-optimization for up to 24 bytes, which many tokens can fit inside of.

Progress:

This PR is not fully complete. At this point in time, the following tasks have been completed:

  • Convert base crate.
    • Tests are passing, and benchmarks were run (linked below).
  • Convert Python bindings.
    • The same tests that pass on HEAD are passing on this branch. There seems to be 2 failing tests on HEAD, but this branch fails the exact same tests, and in the same manner (i.e. same output).
    • There seems to be an issue with my implementation of the pyo3. The benchmarks for the base crate are outperforming HEAD by a decent margin, but the benches/test_tiktoken.py are basically the same (within a margin of error) of HEAD. I think it is caused by unnecessary type conversions.
  • Convert Node bindings.
  • Cleanup.

Benchmarks:

The benchmarks are at the following link: Tokenizer Benchmarks. All benchmarks were run on a Macbook Pro with the M2 Pro chip and 16GB memory.

Additionally, the BPE Train vocabulary (huge) benchmark is one that I added (not committed, as that would warrant its own PR). This benchmark is using the dataset: One Billion Word Challenge, which clocks in at 4.15GB.

@ArthurZucker
Copy link
Collaborator

Sounds good! Node bidings are really not necessary!

@ArthurZucker
Copy link
Collaborator

Do you want this to be reviewed now?

@MeetThePatel
Copy link
Author

Not yet. I'm still working on a few things:

  • See if there actually is a bottleneck in the Python-Rust interface (regarding extra type conversions), or if the benchmarks aren't pushing the crate hard enough to see a meaningful difference. It should be ~10-15% faster (according to the base crate BPE encode benchmark).
  • Clean up code quality.

@MeetThePatel
Copy link
Author

I think this is ready for review.

Also, these are the distributions for the benchmark runs (blue is this PR, red is HEAD). Besides the speed being a bit faster, it seems to be more consistent as well (as least on my machine).

HEAD vs PR base crate benchmarks.pdf

@MeetThePatel MeetThePatel marked this pull request as ready for review February 11, 2025 22:55
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants