You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This commit was created on github.com and signed with GitHub’s verified signature.
The key has expired.
Features:
--byte_fallback: fallback UNK token into UTF-8 byte sequences. 256 byte symbols are reserved in advance. https://arxiv.org/pdf/1909.03341.pdf Note that you need to set --character_coverage less than 1.0, otherwise byte-fall-backed token may not appear in the training data.
--required_chars=chars: Specify the set of Unicode chars that must be included in the final vocab.
--split_digits: Split all digits (0-9) into separate pieces (disabled by default)
Denormalization: Apply extra normalization rule after decoding. We can specify the rule as TSV via --denormalization_rule_tsv=file flag. Note that offset information may not always be preserved.
--train_extremely_large_corpus: Train the unigram model from extremely large corpus (> 10M sentences) to avoid integer overflow. Note that it will increase the memory usage. 300GB or larger memory might be necessary.
Performance improvement:
30%-50% performance improvement is obtained in the default unigram one-best tokenization.