How to reuse Sentencepiece tokenizer from subword ASR training into TransformerLM training? #2746
-
Hi, I am trying to train a TransformerLM for ASR rescoring. I suppose I need to reuse the sentencepiece BPE tokenizer I used for finetuning the Citrinet subword model. For that, I have added the tokenizer config in my LM training config like this -
But I am getting this error -
I see in the , thespecial_tokens is set to None.
What should I do here? Thanks in advance. |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
The tokenizer for neural rescorer does not need to be the same as the one for the asr model. Actually as some asr models use low vocab sizes like 128, it is better to use another tokenizer for the Transformer with larger vocab size like 4k with yttm tokenizer. You may find more info on Transformer LM here: @AlexGrinch would you please take a look at this issue? |
Beta Was this translation helpful? Give feedback.
The tokenizer for neural rescorer does not need to be the same as the one for the asr model. Actually as some asr models use low vocab sizes like 128, it is better to use another tokenizer for the Transformer with larger vocab size like 4k with yttm tokenizer.
You may find more info on Transformer LM here:
https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/nlp/language_modeling.html
@AlexGrinch would you please take a look at this issue?