Skip to content

Releases: google/sentencepiece

v0.1.92

08 Jun 09:05
Compare
Choose a tag to compare

Bug fix

  • Fixed the regression bug around the flag --minloglevel
  • Fixed build break on Solaris.

Minor upgrade

  • upgrade builtin protobuf to 3.12.3
  • Implmeneted absl::flags port.

v0.1.91

21 May 03:25
a32d7dc
Compare
Choose a tag to compare

New API

Bug Fix

  • Ignores nbest parameter in BPE-dropout
  • fixed build error when SPM_ENABLE_NFKC_COMPILE=ON
  • fixed the cost computation around user_defined_symbol and faster encoding introduced in the previous release.

v0.1.90

13 May 06:20
Compare
Choose a tag to compare

Renamed v0.1.9 to v0.1.90 because PyPI doesn't recognize 0.1.9 as the latest release.

v0.1.9

13 May 02:52
6f58436
Compare
Choose a tag to compare

Features:

  • --byte_fallback: fallback UNK token into UTF-8 byte sequences. 256 byte symbols are reserved in advance.
    https://arxiv.org/pdf/1909.03341.pdf Note that you need to set --character_coverage less than 1.0, otherwise byte-fall-backed token may not appear in the training data.
  • BPE-dropout: Implemented BPE dropout. https://arxiv.org/abs/1910.13267
    Sampling API is available for the BPE.
    https://github.com/google/sentencepiece/blob/master/src/sentencepiece_processor.h#L287
  • --required_chars=chars: Specify the set of Unicode chars that must be included in the final vocab.
  • --split_digits: Split all digits (0-9) into separate pieces (disabled by default)
  • Denormalization: Apply extra normalization rule after decoding. We can specify the rule as TSV via --denormalization_rule_tsv=file flag. Note that offset information may not always be preserved.
  • --train_extremely_large_corpus: Train the unigram model from extremely large corpus (> 10M sentences) to avoid integer overflow. Note that it will increase the memory usage. 300GB or larger memory might be necessary.

Performance improvement:

  • 30%-50% performance improvement is obtained in the default unigram one-best tokenization.

New API

v0.1.86

24 Apr 09:34
e8a84a1
Compare
Choose a tag to compare
  • Support tf 1.5.1 2.0.0 2.0.1 2.1.0 and 2.2.0rc3
  • Added python wrapper for Python3.8 on Mac

v0.1.85

15 Dec 15:39
Compare
Choose a tag to compare

Support tf 1.15 and Python3.8 on Windows

v0.1.84

12 Oct 09:01
Compare
Choose a tag to compare
  • Support tf 2.0.0

v0.1.83

16 Aug 15:09
17568d0
Compare
Choose a tag to compare
  • Use the official docker image to build tf_sentencepiece ops
  • support tf 1.14.0 and tf 2.0.0-beta1.

Sentencepiece re-release

24 Jun 02:55
Compare
Choose a tag to compare
Pre-release

Releases a new version of Sentencepiece with major refactorings:

  • Builds with Bazel
  • Re-uses existing open source libraries whenever possible
  • Refactors internal dependencies
  • New sets of features for configuring tokenizers
  • Separation from Tensorflow

v0.1.82

13 Apr 16:36
Compare
Choose a tag to compare

Bug fix: fixed the behavior of is_unknown method in Python module.