To automatically convert data, train a SentencePiece tokenizer, and merge the tokenizer, you can run the following script:
bash scripts/vocab_extension/train_merge_tokenizer.sh
Alternatively, you can run each of the three steps separately:
To convert JSON data to TXT for sentencepiece tokenizer training, run:
bash scripts/vocab_extension/convert_json_to_txt.sh
To train a SentencePiece tokenizer, run:
bash scripts/vocab_extension/train_tokenizer.sh
To merge a new tokenizer with the original one, run:
bash scripts/vocab_extension/merge_tokenizer.sh