Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Ja model improvement #410

Merged
merged 2 commits into from
Dec 13, 2023
Merged

Ja model improvement #410

merged 2 commits into from
Dec 13, 2023

Conversation

tushuhei
Copy link
Member

Fixes #387 #220 #216 #157

This new Japanese model addresses several quality issues, incorporating a "weighted samples" approach that emphasizes fine-tune data during training. It leverages recent updates to the training script (including those in #358 and #408), and was generated using the following commands:

curl -o knbc.tar.bz2 https://nlp.ist.i.kyoto-u.ac.jp/kuntt/KNBC_v1.0_090925_utf8.tar.bz2
tar -xf knbc.tar.bz2  # this generates the KNBC_v1.0_090925_utf8 directory.
python budoux/scripts/prepare_knbc.py KNBC_v1.0_090925_utf8 -o source_knbc.txt
shuf --random-source=source_knbc.txt source_knbc.txt | split -l $[ $(wc -l source_knbc.txt | cut -d" " -f1) * 90 / 100 ]
python budoux/scripts/encode_data.py budoux/data/finetuning/ja/train.txt -o train_finetune.txt --scale=100
python budoux/scripts/encode_data.py xaa -o train_knbc.txt
cat train_knbc.txt train_finetune.txt > train.txt
python budoux/scripts/encode_data.py xab -o val.txt
python budoux/scripts/train.py train.txt --iter=150000 --val-data=val.txt --output=weights.txt --scale=1
python budoux/scripts/build_model.py weights.txt -o model.json

@tushuhei tushuhei requested a review from kojiishi December 13, 2023 06:51
Copy link
Collaborator

@kojiishi kojiishi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgt m

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
quality Model quality improvements
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[quality] "のみ"
2 participants