How to convert models with two vocab files to PyTorch? #22

phiradet · 2022-02-07T13:03:07Z

Hi,

I would like to get translation result from the eng-kor model with transformers.MarianMTModel and transformers.MarianTokenizer. I understand we need to first convert the model to PyTorch format with convert_marian_tatoeba_to_pytorch.py first.

The eng-kor has two different vocab sets for encoder and decoder. How can we use transformers.models.marian.convert_marian_to_pytorch.convert function to do the conversion?
Because there is no vocab.yml file in the zip file, I found the line 381 throws IndexError: list index out of range error.

Thanks

The text was updated successfully, but these errors were encountered:

jorgtied · 2022-02-07T21:02:08Z

This model uses two separate vocabularies and it does not properly convert to pytorch and huggingface at the moment. Hopefully, this will be added soon to the conversion procedures.

phiradet · 2022-02-08T04:26:07Z

Thanks @jorgtied
I plan to use the model for backtranslation. While waiting for the update, I am ok to use the zho-kor model instead. But, the link https://object.pouta.csc.fi/Tatoeba-MT-models/zho-kor/opus-2020-09-10.zip shown in the zho-kor README.md doesn't work. Could you provide me the correct link?

Thanks

jorgtied · 2022-05-02T20:12:40Z

The latest conversion scripts in the transformer library support the conversion of models with two vocabs. You may also check my recipes in https://github.com/Helsinki-NLP/Opus-MT/tree/master/hf

jorgtied · 2022-05-02T20:14:53Z

Thanks @jorgtied I plan to use the model for backtranslation. While waiting for the update, I am ok to use the zho-kor model instead. But, the link https://object.pouta.csc.fi/Tatoeba-MT-models/zho-kor/opus-2020-09-10.zip shown in the zho-kor README.md doesn't work. Could you provide me the correct link?

Thanks

I removed that model because it was so poor (at least according to the scores). I should create new ones for this language pair.

zhiqihuang · 2022-08-15T03:28:54Z

The latest conversion scripts in the transformer library support the conversion of models with two vocabs. You may also check my recipes in https://github.com/Helsinki-NLP/Opus-MT/tree/master/hf

Hi, I still got the same error IndexError: list index out of range because of the two vocab files.

I used the script from transformers, i.e., python3 -m transformers.models.marian.convert_marian_to_pytorch and my transformers version is transformers-4.21.1.

I also tried the convert_to_pytorch.py script you suggested, same error.

Can you show me the command to convert such two vocab model pytorch?

Thanks

Limess · 2024-01-11T09:26:55Z

More resources on these split vocab models would be helpful. I'm also trying to compile these to CTranslate2 and having difficulties due to the split vocabs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to convert models with two vocab files to PyTorch? #22

How to convert models with two vocab files to PyTorch? #22

phiradet commented Feb 7, 2022

jorgtied commented Feb 7, 2022

phiradet commented Feb 8, 2022

jorgtied commented May 2, 2022

jorgtied commented May 2, 2022

zhiqihuang commented Aug 15, 2022

Limess commented Jan 11, 2024

How to convert models with two vocab files to PyTorch? #22

How to convert models with two vocab files to PyTorch? #22

Comments

phiradet commented Feb 7, 2022

jorgtied commented Feb 7, 2022

phiradet commented Feb 8, 2022

jorgtied commented May 2, 2022

jorgtied commented May 2, 2022

zhiqihuang commented Aug 15, 2022

Limess commented Jan 11, 2024