Skip to content

Tokenizer 1.20.0

Compare
Choose a tag to compare
@guillaumekln guillaumekln released this 24 Sep 08:49
· 203 commits to master since this release

Changes

  • The following changes affect users compiling the project from the source. They ensure users get the best performance and all features by default:
    • ICU is now required to improve performance and Unicode support
    • SentencePiece is now integrated as a Git submodule and linked statically to the project
    • Boost is no longer required, the project now uses cxxopts which is integrated as a Git submodule
    • The project is compiled in Release mode by default
    • Tests are no longer compiled by default (use -DBUILD_TESTS=ON to compile the tests)

New features

  • Accept any Unicode script aliases in the segment_alphabet option
  • Update SentencePiece to 0.1.92
  • [Python] Improve the capabilities of the Token class:
    • Implement the __repr__ method
    • Allow setting all attributes in the constructor
    • Add a copy constructor
  • [Python] Add a copy constructor for the Tokenizer class

Fixes and improvements

  • [Python] Accept None value for segment_alphabet argument