This section demonstrates an example of M4T finetuning on a single translation direction: English-to-Korean.
The trainer and dataloader were designed mainly for demonstration purposes. Their simplicity should facilitate the code transparency and portability.
M4T training dataset is a multimodal parallel corpus. Each training sample has four parts: audio and text representation of the sample in the source language, and its corresponding audio and text representation in the target language.
That kind of dataset can be prepared using dataset.py
script that downloads FLEURS dataset from HuggingFace datasets hub, (optionally) extracts units from the target audio samples, and prepares a manifest consumable by finetune.py
. Manifest is a text file where each line represents information about a single dataset sample, serialized in JSON format.
List of input arguments for dataset.py
:
--source_lang SOURCE_LANG
M4T langcode of the dataset SOURCE language
--target_lang TARGET_LANG
M4T langcode of the dataset TARGET language
--split SPLIT Dataset split/shard to download (`train`, `test`)
--save_dir SAVE_DIR Directory where the datasets will be stored with HuggingFace datasets cache files
Language codes should follow the notation adopted by M4T models.
Below is an example bash script that prepares a training and evaluation dataset for the translation direction English-to-Korean:
export DATASET_DIR=~/m4t_dataset
mkdir -p $DATASET_DIR
m4t_prepare_dataset \
--source_lang eng \
--target_lang kor \
--split train \
--save_dir $DATASET_DIR
m4t_prepare_dataset \
--source_lang eng \
--target_lang kor \
--split validation \
--save_dir $DATASET_DIR
Output manifests will be stored in ${DATASET_DIR}/train_manifest.json
and ${DATASET_DIR}/validation_manifest.json
.
finetune.py
is an example finetuning script that initializes dataloaders, and launches training loop with periodic scoring against the validation dataset.
It is recommended to launch it with torchrun
. Multi-gpu and multi-node training are supported out of the box.
List of input arguments for finetune.py
:
--train_dataset TRAIN_DATASET
Path to manifest with train samples
--eval_dataset EVAL_DATASET
Path to manifest with eval samples
--model_name MODEL_NAME
Base model name (e.g, `seamlessM4T_medium`, `seamlessM4T_large`)
--save_model_to SAVE_MODEL_TO
Path to save best finetuned model
--seed SEED Randomizer seed value
--batch_size BATCH_SIZE
Batch size for training and evaluation
--patience PATIENCE Set early termination after `patience` number of evaluations without eval loss improvements
--max_epochs MAX_EPOCHS
Max number of training epochs
--learning_rate LEARNING_RATE
Finetuning learning rate
--warmup_steps WARMUP_STEPS
Number of steps with linearly increasing learning rate
--eval_steps EVAL_STEPS
Get eval loss after each `eval_steps` training steps
--log_steps LOG_STEPS
Log inner loss after each `log_steps` training steps
--mode {FinetuneMode.SPEECH_TO_SPEECH,FinetuneMode.SPEECH_TO_TEXT,FinetuneMode.TEXT_TO_SPEECH}
* `SPEECH_TO_SPEECH` -- finetune S2T and T2U parts of the model;
* `TEXT_TO_SPEECH` -- finetune only T2U;
* `SPEECH_TO_TEXT` -- finetune only S2T
The scripts supports three modes of finetuning:
SPEECH_TO_SPEECH
: in this case all model weights except the text encoder will be engaged;TEXT_TO_SPEECH
: only text-to-unit part of the model will be engaged in the finetuning, other weights will be frozen;SPEECH_TO_TEXT
: only speech-to-text part of the model will be engaged in the finetuning.
The referenced finetuning script does not support finetuning of the text encoder.
Below is an example bash script that launches finetuning of M4T-large on the dataset prepared earlier, using a single node with eight GPUs:
torchrun \
--rdzv-backend=c10d \
--rdzv-endpoint=localhost:0 \
--nnodes=1 \
--nproc-per-node=8 \
--no-python \
m4t_finetune \
--mode SPEECH_TO_TEXT \
--train_dataset $DATASET_DIR/train_manifest.json \
--eval_dataset $DATASET_DIR/validation_manifest.json \
--learning_rate 1e-6 \
--warmup_steps 100 \
--max_epochs 10 \
--patience 3 \
--model_name seamlessM4T_large \
--save_model_to $DATASET_DIR/checkpoint.pt
Excerpt from an example finetuning log:
...
2023-08-21 14:46:16,936 INFO -- trainer.1100368: Eval after 300 updates: loss=8.7755 best_loss=8.7755 patience_steps_left=3
2023-08-21 14:46:16,936 INFO -- trainer.1100368: Saving model
2023-08-21 14:46:35,863 INFO -- trainer.1100368: Epoch 006 / update 00310: train loss=16.3768 last lr=5.68E-08
2023-08-21 14:46:42,610 INFO -- trainer.1100368: Epoch 006 / update 00320: train loss=16.3730 last lr=5.59E-08
2023-08-21 14:46:48,285 INFO -- trainer.1100368: Epoch 006 / update 00330: train loss=16.4598 last lr=5.50E-08
2023-08-21 14:46:54,390 INFO -- trainer.1100368: Epoch 006 / update 00340: train loss=16.4218 last lr=5.42E-08
2023-08-21 14:47:08,461 INFO -- trainer.1100368: Epoch 006 / update 00350: train loss=16.3906 last lr=5.35E-08
2023-08-21 14:47:09,067 INFO -- trainer.1100368: Run evaluation
2023-08-21 14:47:19,205 INFO -- trainer.1100368: Eval after 350 updates: loss=8.7462 best_loss=8.7462 patience_steps_left=3
2023-08-21 14:47:19,205 INFO -- trainer.1100368: Saving model
2023-08-21 14:47:44,981 INFO -- trainer.1100368: Epoch 007 / update 00360: train loss=16.4267 last lr=5.27E-08
2023-08-21 14:47:51,383 INFO -- trainer.1100368: Epoch 007 / update 00370: train loss=16.3630 last lr=5.20E-08
2023-08-21 14:47:58,305 INFO -- trainer.1100368: Epoch 007 / update 00380: train loss=16.3666 last lr=5.13E-08
2023-08-21 14:48:04,396 INFO -- trainer.1100368: Epoch 007 / update 00390: train loss=16.3605 last lr=5.06E-08
2023-08-21 14:48:10,630 INFO -- trainer.1100368: Epoch 007 / update 00400: train loss=16.3518 last lr=5.00E-08
...