Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

NaN unigram model score error with sentencepiece 0.1.98 #851

Closed
lucaslingle opened this issue Apr 14, 2023 · 3 comments
Closed

NaN unigram model score error with sentencepiece 0.1.98 #851

lucaslingle opened this issue Apr 14, 2023 · 3 comments
Assignees
Labels

Comments

@lucaslingle
Copy link

On a clean ubuntu machine with sentencepiece 0.1.98 installed via pip, I am getting nan scores when training a unigram model.

For example, the following script does not work. However, it worked with version 0.1.97.

import tempfile
import tensorflow_datasets as tfds
import sentencepiece as spm

def dump_chars_to_tempfile(ds, maxchars):
    char_count = 0
    with tempfile.NamedTemporaryFile(delete=False, prefix="/tmp/ds_chars") as outfp:
        for document_chars in ds:
            if char_count >= maxchars:
                break
            outfp.write(document_chars + b" ")
            char_count += len(document_chars)
        return outfp.name, char_count

chardump_ds = tfds.load("wiki40b/en:1.3.0", split="train").map(lambda r: r["text"]).as_numpy_iterator()
fname, _ = dump_chars_to_tempfile(ds=chardump_ds, maxchars=int(1e8))

temp_fp = tempfile.NamedTemporaryFile(delete=False, prefix="/tmp/sp_tmp")
spm.SentencePieceTrainer.Train(
    input=fname,
    vocab_size=32000,
    character_coverage=1.0,
    model_prefix=temp_fp.name,
    model_type="unigram",
    user_defined_symbols=[],
    pad_id=0,
    bos_id=-1,  # disable bos id
    eos_id=1,
    unk_id=2,
)

The stacktrace is

sentencepiece_trainer.cc(77) LOG(INFO) Starts training with : 
trainer_spec {
  input: /tmp/ds_charsrpp7ukvr
  input_format: 
  model_prefix: /tmp/sp_tmphrtwan9z
  model_type: UNIGRAM
  vocab_size: 32000
  self_test_sample_size: 0
  character_coverage: 1
  input_sentence_size: 0
  shuffle_input_sentence: 1
  seed_sentencepiece_size: 1000000
  shrinking_factor: 0.75
  max_sentence_length: 4192
  num_threads: 16
  num_sub_iterations: 2
  max_sentencepiece_length: 16
  split_by_unicode_script: 1
  split_by_number: 1
  split_by_whitespace: 1
  split_digits: 0
  pretokenization_delimiter: 
  treat_whitespace_as_suffix: 0
  allow_whitespace_only_pieces: 0
  required_chars: 
  byte_fallback: 0
  vocabulary_output_piece_score: 1
  train_extremely_large_corpus: 0
  hard_vocab_limit: 1
  use_all_vocab: 0
  unk_id: 2
  bos_id: -1
  eos_id: 1
  pad_id: 0
  unk_piece: <unk>
  bos_piece: <s>
  eos_piece: </s>
  pad_piece: <pad>
  unk_surface:  ⁇ 
  enable_differential_privacy: 0
  differential_privacy_noise_level: 0
  differential_privacy_clipping_threshold: 0
}
normalizer_spec {
  name: nmt_nfkc
  add_dummy_prefix: 1
  remove_extra_whitespaces: 1
  escape_whitespaces: 1
  normalization_rule_tsv: 
}
denormalizer_spec {}
trainer_interface.cc(351) LOG(INFO) SentenceIterator is not specified. Using MultiFileSentenceIterator.
trainer_interface.cc(183) LOG(INFO) Loading corpus: /tmp/ds_charsrpp7ukvr
trainer_interface.cc(378) LOG(WARNING) Found too long line (4536 > 4192).
trainer_interface.cc(380) LOG(WARNING) Too long lines are skipped in the training.
trainer_interface.cc(381) LOG(WARNING) The maximum length can be changed with --max_sentence_length=<size> flag.
trainer_interface.cc(407) LOG(INFO) Loaded all 425807 sentences
trainer_interface.cc(414) LOG(INFO) Skipped 1935 too long sentences.
trainer_interface.cc(423) LOG(INFO) Adding meta_piece: <pad>
trainer_interface.cc(423) LOG(INFO) Adding meta_piece: </s>
trainer_interface.cc(423) LOG(INFO) Adding meta_piece: <unk>
trainer_interface.cc(428) LOG(INFO) Normalizing sentences...
trainer_interface.cc(537) LOG(INFO) all chars count=88479119
trainer_interface.cc(548) LOG(INFO) Done: 100% characters are covered.
trainer_interface.cc(558) LOG(INFO) Alphabet size=3623
trainer_interface.cc(559) LOG(INFO) Final character coverage=1
trainer_interface.cc(591) LOG(INFO) Done! preprocessed 425801 sentences.
unigram_model_trainer.cc(247) LOG(INFO) Making suffix array...
unigram_model_trainer.cc(251) LOG(INFO) Extracting frequent sub strings... node_num=43616444
unigram_model_trainer.cc(301) LOG(INFO) Initialized 577125 seed sentencepieces
unigram_model_trainer.cc(150) [!std::isnan(score)] 
Program terminated with an unrecoverable error.

Thought the developers would want to know. I will use version 0.1.97 in the meantime. Thank you!

@taku910
Copy link
Collaborator

taku910 commented Apr 15, 2023

Thank you. It seems that the seed vocabulary has an extremely large score. It would be a critical, so we will fix it soon. Thank you for the report.

@taku910 taku910 self-assigned this Apr 20, 2023
@taku910 taku910 added the bug label Apr 20, 2023
@taku910
Copy link
Collaborator

taku910 commented May 2, 2023

@ngan-nt
Copy link

ngan-nt commented Jul 7, 2023

Hi, I still have this problem even though I install 0.1.99 version. Thank you!

trainer_interface.cc(522) LOG(INFO) Found null character. The corpus must be encoded in utf-8.                                                                                           
trainer_interface.cc(537) LOG(INFO) all chars count=2194048217
trainer_interface.cc(548) LOG(INFO) Done: 99.99% characters are covered.
trainer_interface.cc(558) LOG(INFO) Alphabet size=3063
trainer_interface.cc(559) LOG(INFO) Final character coverage=0.9999
trainer_interface.cc(591) LOG(INFO) Done! preprocessed 967719 sentences.
unigram_model_trainer.cc(222) LOG(INFO) Making suffix array...
unigram_model_trainer.cc(226) LOG(INFO) Extracting frequent sub strings... node_num=1237217680                                                                                           
unigram_model_trainer.cc(274) LOG(INFO) Initialized 1003063 seed sentencepieces
trainer_interface.cc(597) LOG(INFO) Tokenizing input sentences with whitespace: 967719
trainer_interface.cc(608) LOG(INFO) Done! 40765528
unigram_model_trainer.cc(564) LOG(INFO) Using 40765528 sentences for EM training
unigram_model_trainer.cc(580) LOG(INFO) EM sub_iter=0 size=702716 obj=59.4182 num_tokens=479275856 num_tokens/piece=682.034                                                              
unigram_model_trainer.cc(580) LOG(INFO) EM sub_iter=1 size=530593 obj=80.5757 num_tokens=480490361 num_tokens/piece=905.572                                                              
unigram_model_trainer.cc(580) LOG(INFO) EM sub_iter=0 size=363670 obj=58.2552 num_tokens=502565215 num_tokens/piece=1381.93                                                              
unigram_model_trainer.cc(580) LOG(INFO) EM sub_iter=1 size=357052 obj=62.4634 num_tokens=519818158 num_tokens/piece=1455.86                                                              
unigram_model_trainer.cc(580) LOG(INFO) EM sub_iter=0 size=267764 obj=58.906 num_tokens=510585662 num_tokens/piece=1906.85                                                               
unigram_model_trainer.cc(125) [!std::isnan(score)]
Program terminated with an unrecoverable error.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants
@lucaslingle @taku910 @ngan-nt and others