How to reuse Sentencepiece tokenizer from subword ASR training into TransformerLM training? #2746

muntasir2000 · 2021-08-29T12:10:44Z

muntasir2000
Aug 29, 2021

Hi, I am trying to train a TransformerLM for ASR rescoring. I suppose I need to reuse the sentencepiece BPE tokenizer I used for finetuning the Citrinet subword model. For that, I have added the tokenizer config in my LM training config like this -

  tokenizer:
    tokenizer_name: sentencepiece
    tokenizer_model: tokenizers/bn/tokenizer_spe_bpe_v1024/tokenizer.model
    vocab_file: tokenizers/bn/tokenizer_spe_bpe_v1024/tokenizer.vocab
    special_tokens: 
      bos_token: <s>
      eos_token: </s>
      pad_token: <pad>

But I am getting this error -

Traceback (most recent call last):
  File "examples/nlp/language_modeling/transformer_lm.py", line 30, in main
    transformer_lm = TransformerLMModel(cfg.model, trainer=trainer)
  File "/home/ubuntu/anaconda3/envs/n2/lib/python3.8/site-packages/nemo/collections/nlp/models/language_modeling/transformer_lm_model.py", line 65, in __init__
    super().__init__(cfg=cfg, trainer=trainer)
  File "/home/ubuntu/anaconda3/envs/n2/lib/python3.8/site-packages/nemo/core/classes/modelPT.py", line 128, in __init__
    self.setup_training_data(self._cfg.train_ds)
  File "/home/ubuntu/anaconda3/envs/n2/lib/python3.8/site-packages/nemo/collections/nlp/models/language_modeling/transformer_lm_model.py", line 206, in setup_training_data
    self._train_dl = self._setup_dataloader_from_config(cfg=train_data_config)
  File "/home/ubuntu/anaconda3/envs/n2/lib/python3.8/site-packages/nemo/collections/nlp/models/language_modeling/transformer_lm_model.py", line 252, in _setup_dataloader_from_config
    dataset = SentenceDataset(
  File "/home/ubuntu/anaconda3/envs/n2/lib/python3.8/site-packages/nemo/collections/nlp/data/language_modeling/sentence_dataset.py", line 49, in __init__
    ids = dataset_to_ids(dataset, tokenizer, cache_ids=cache_ids)
  File "/home/ubuntu/anaconda3/envs/n2/lib/python3.8/site-packages/nemo/collections/nlp/data/data_utils/data_preprocessing.py", line 385, in dataset_to_ids
    sent_ids = [tokenizer.bos_id] + sent_ids + [tokenizer.eos_id]
  File "/home/ubuntu/anaconda3/envs/n2/lib/python3.8/site-packages/nemo/collections/common/tokenizers/sentencepiece_tokenizer.py", line 201, in bos_id
    bos_id = self.tokens_to_ids([self.bos_token])[0]
AttributeError: 'SentencePieceTokenizer' object has no attribute 'bos_token'

I see in the

NeMo/nemo/collections/nlp/models/language_modeling/transformer_lm_model.py

Line 202 in c24e428

self.tokenizer = get_tokenizer(

, the special_tokens is set to None.

What should I do here? Thanks in advance.

Answered by VahidooX

Aug 30, 2021

The tokenizer for neural rescorer does not need to be the same as the one for the asr model. Actually as some asr models use low vocab sizes like 128, it is better to use another tokenizer for the Transformer with larger vocab size like 4k with yttm tokenizer.

You may find more info on Transformer LM here:
https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/nlp/language_modeling.html

@AlexGrinch would you please take a look at this issue?

View full answer

VahidooX · 2021-08-30T21:58:26Z

VahidooX
Aug 30, 2021
Collaborator

The tokenizer for neural rescorer does not need to be the same as the one for the asr model. Actually as some asr models use low vocab sizes like 128, it is better to use another tokenizer for the Transformer with larger vocab size like 4k with yttm tokenizer.

You may find more info on Transformer LM here:
https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/nlp/language_modeling.html

@AlexGrinch would you please take a look at this issue?

1 reply

AlexGrinch Sep 3, 2021

That is correct, the tokenizer in LM should not necessarily be the same as tokenizer used in ASR. Currently, only yttm tokenizer is tested properly for LM training, so I recommend to use it.

I will take a look at SentencePiece to see why it fails.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to reuse Sentencepiece tokenizer from subword ASR training into TransformerLM training? #2746

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

How to reuse Sentencepiece tokenizer from subword ASR training into TransformerLM training? #2746

muntasir2000 Aug 29, 2021

Replies: 1 comment · 1 reply

VahidooX Aug 30, 2021 Collaborator

AlexGrinch Sep 3, 2021

muntasir2000
Aug 29, 2021

Replies: 1 comment 1 reply

VahidooX
Aug 30, 2021
Collaborator