Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

duplicate tokens in user_defined_symbols param cause RuntimeError #811

Closed
poedator opened this issue Jan 26, 2023 · 2 comments
Closed

duplicate tokens in user_defined_symbols param cause RuntimeError #811

poedator opened this issue Jan 26, 2023 · 2 comments

Comments

@poedator
Copy link

poedator commented Jan 26, 2023

when training with user_defined_symbols parameters I ran into this error
RuntimeError: Internal: src/trainer_interface.cc(717) [insert_meta_symbol(w, ModelProto::SentencePiece::USER_DEFINED)]
which was not very informative. Fortunately I soon realized that this was caused by duplicate symbols in the user_defined_symbols list.
You may want to make this error message more explicit and/or add separate check for non-unique symbols.

To reproduce run similar code with duplicate symbols in user_defined_symbols param list:

spm.SentencePieceTrainer.Train(input='manywordswithfreq.tsv',
                               user_defined_symbols=['LOL', 'LOL', 'KEK'],
                               input_format='tsv',
                               hard_vocab_limit=False,
                               model_prefix='m',
                               vocab_size=10000)```
@taku910
Copy link
Collaborator

taku910 commented Jan 31, 2023

Thank you for the report. We will make the error message more descriptive in the next release.

@taku910
Copy link
Collaborator

taku910 commented Apr 12, 2023

Fixed in v0.1.98

@taku910 taku910 closed this as completed Apr 12, 2023
# for free to join this conversation on GitHub. Already have an account? # to comment
Projects
None yet
Development

No branches or pull requests

2 participants