Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Wrong tokenizer/vocab for the 'Helsinki-NLP/opus-mt-tc-big-en-ko' model #81

Open
regpath opened this issue Sep 13, 2022 · 0 comments
Open

Comments

@regpath
Copy link

regpath commented Sep 13, 2022

The translation result from English to Korean using the 'Helsinki-NLP/opus-mt-tc-big-en-ko' model does not make sense at all

from transformers import MarianMTModel, MarianTokenizer
src_text = [
    "2, 4, 6 etc. are even numbers.",
    "Yes."
]

tokenizer = MarianTokenizer.from_pretrained(MODEL_PATH3)
model = MarianMTModel.from_pretrained(MODEL_PATH3)
translated = model.generate(**tokenizer(src_text, return_tensors="pt", padding=True))

for t in translated:
    print( tokenizer.decode(t, skip_special_tokens=True) )

The result is not ['2, 4, 6 등은 짝수입니다.', '그래'] as in the example, but ['그들은,우리는,우리는 모자입니다. 신뢰할 수 있습니다.', 'ATP입니다.'] which does not make sense at all.

I tried some more sentences and believe that correct tokenizer or vocab file can correct this problem.
Could you take a look at it?

@regpath regpath changed the title Wrong tokenizer for the 'Helsinki-NLP/opus-mt-tc-big-en-ko' model Wrong tokenizer/vocab for the 'Helsinki-NLP/opus-mt-tc-big-en-ko' model Sep 13, 2022
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant