Skip to content

v0.1.9

Compare
Choose a tag to compare
@taku910 taku910 released this 13 May 02:52
6f58436

Features:

  • --byte_fallback: fallback UNK token into UTF-8 byte sequences. 256 byte symbols are reserved in advance.
    https://arxiv.org/pdf/1909.03341.pdf Note that you need to set --character_coverage less than 1.0, otherwise byte-fall-backed token may not appear in the training data.
  • BPE-dropout: Implemented BPE dropout. https://arxiv.org/abs/1910.13267
    Sampling API is available for the BPE.
    https://github.com/google/sentencepiece/blob/master/src/sentencepiece_processor.h#L287
  • --required_chars=chars: Specify the set of Unicode chars that must be included in the final vocab.
  • --split_digits: Split all digits (0-9) into separate pieces (disabled by default)
  • Denormalization: Apply extra normalization rule after decoding. We can specify the rule as TSV via --denormalization_rule_tsv=file flag. Note that offset information may not always be preserved.
  • --train_extremely_large_corpus: Train the unigram model from extremely large corpus (> 10M sentences) to avoid integer overflow. Note that it will increase the memory usage. 300GB or larger memory might be necessary.

Performance improvement:

  • 30%-50% performance improvement is obtained in the default unigram one-best tokenization.

New API