Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

different sizes of dictionaries in different models #85

Open
bariluz93 opened this issue Nov 27, 2022 · 1 comment
Open

different sizes of dictionaries in different models #85

bariluz93 opened this issue Nov 27, 2022 · 1 comment

Comments

@bariluz93
Copy link

Hi,
I use different tokenizers for different languages:

Helsinki-NLP/opus-mt-en-de
Helsinki-NLP/opus-mt-en-he
Helsinki-NLP/opus-mt-en-ru
Helsinki-NLP/opus-mt-en-es

I see that the English parts of the dictionaries are different
for example
tokenizer_he.tokenize("housekeeper") outputs
['▁housekeeper']
and
tokenizer_es.tokenize("housekeeper") outputs
['▁house', 'keeper']

I want to know what is the reason for this different
Was it trained on different dataset?
Thank you
Bar

@jorgtied
Copy link
Member

Yes, all models actually have their own sentence piece model trained on each side of the bitext used for training.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants