README.md

Vocab Extension

To automatically convert data, train a SentencePiece tokenizer, and merge the tokenizer, you can run the following script:

bash scripts/vocab_extension/train_merge_tokenizer.sh

Alternatively, you can run each of the three steps separately:

To convert JSON data to TXT for sentencepiece tokenizer training, run:

bash scripts/vocab_extension/convert_json_to_txt.sh

To train a SentencePiece tokenizer, run:

bash scripts/vocab_extension/train_tokenizer.sh

To merge a new tokenizer with the original one, run:

bash scripts/vocab_extension/merge_tokenizer.sh