Skip to content

Latest commit

 

History

History
 
 

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Finetuning scripts for M4T

This section demonstrates an example of M4T finetuning on a single translation direction: English-to-Korean.

The trainer and dataloader were designed mainly for demonstration purposes. Their simplicity should facilitate the code transparency and portability.

Data preparation

M4T training dataset is a multimodal parallel corpus. Each training sample has four parts: audio and text representation of the sample in the source language, and its corresponding audio and text representation in the target language.

That kind of dataset can be prepared using dataset.py script that downloads FLEURS dataset from HuggingFace datasets hub, (optionally) extracts units from the target audio samples, and prepares a manifest consumable by finetune.py. Manifest is a text file where each line represents information about a single dataset sample, serialized in JSON format.

List of input arguments for dataset.py:

  --source_lang SOURCE_LANG
                        M4T langcode of the dataset SOURCE language
  --target_lang TARGET_LANG
                        M4T langcode of the dataset TARGET language
  --split SPLIT         Dataset split/shard to download (`train`, `test`)
  --save_dir SAVE_DIR   Directory where the datasets will be stored with HuggingFace datasets cache files

Language codes should follow the notation adopted by M4T models.

Below is an example bash script that prepares a training and evaluation dataset for the translation direction English-to-Korean:

export DATASET_DIR=~/m4t_dataset
mkdir -p $DATASET_DIR

m4t_prepare_dataset \
  --source_lang eng \
  --target_lang kor \
  --split train \
  --save_dir $DATASET_DIR
m4t_prepare_dataset \
  --source_lang eng \
  --target_lang kor \
  --split validation \
  --save_dir $DATASET_DIR

Output manifests will be stored in ${DATASET_DIR}/train_manifest.json and ${DATASET_DIR}/validation_manifest.json.

Finetuning

finetune.py is an example finetuning script that initializes dataloaders, and launches training loop with periodic scoring against the validation dataset. It is recommended to launch it with torchrun. Multi-gpu and multi-node training are supported out of the box.

List of input arguments for finetune.py:

  --train_dataset TRAIN_DATASET
                        Path to manifest with train samples
  --eval_dataset EVAL_DATASET
                        Path to manifest with eval samples
  --model_name MODEL_NAME
                        Base model name (e.g, `seamlessM4T_medium`, `seamlessM4T_large`)
  --save_model_to SAVE_MODEL_TO
                        Path to save best finetuned model
  --seed SEED           Randomizer seed value
  --batch_size BATCH_SIZE
                        Batch size for training and evaluation
  --patience PATIENCE   Set early termination after `patience` number of evaluations without eval loss improvements
  --max_epochs MAX_EPOCHS
                        Max number of training epochs
  --learning_rate LEARNING_RATE
                        Finetuning learning rate
  --warmup_steps WARMUP_STEPS
                        Number of steps with linearly increasing learning rate
  --eval_steps EVAL_STEPS
                        Get eval loss after each `eval_steps` training steps
  --log_steps LOG_STEPS
                        Log inner loss after each `log_steps` training steps
  --mode {FinetuneMode.SPEECH_TO_SPEECH,FinetuneMode.SPEECH_TO_TEXT,FinetuneMode.TEXT_TO_SPEECH}
                        * `SPEECH_TO_SPEECH` -- finetune S2T and T2U parts of the model;
                        * `TEXT_TO_SPEECH` -- finetune only T2U;
                        * `SPEECH_TO_TEXT` -- finetune only S2T

The scripts supports three modes of finetuning:

  • SPEECH_TO_SPEECH: in this case all model weights except the text encoder will be engaged;
  • TEXT_TO_SPEECH: only text-to-unit part of the model will be engaged in the finetuning, other weights will be frozen;
  • SPEECH_TO_TEXT: only speech-to-text part of the model will be engaged in the finetuning.

The referenced finetuning script does not support finetuning of the text encoder.

Below is an example bash script that launches finetuning of M4T-large on the dataset prepared earlier, using a single node with eight GPUs:

torchrun \
   --rdzv-backend=c10d \
   --rdzv-endpoint=localhost:0 \
   --nnodes=1 \
   --nproc-per-node=8  \
   --no-python \
  m4t_finetune \
   --mode SPEECH_TO_TEXT \
   --train_dataset $DATASET_DIR/train_manifest.json  \
   --eval_dataset $DATASET_DIR/validation_manifest.json \
   --learning_rate 1e-6 \
   --warmup_steps 100 \
   --max_epochs 10 \
   --patience 3 \
   --model_name seamlessM4T_large \
   --save_model_to $DATASET_DIR/checkpoint.pt

Excerpt from an example finetuning log:

...
2023-08-21 14:46:16,936 INFO -- trainer.1100368: Eval after 300 updates: loss=8.7755 best_loss=8.7755 patience_steps_left=3
2023-08-21 14:46:16,936 INFO -- trainer.1100368: Saving model
2023-08-21 14:46:35,863 INFO -- trainer.1100368: Epoch 006 / update 00310: train loss=16.3768 last lr=5.68E-08
2023-08-21 14:46:42,610 INFO -- trainer.1100368: Epoch 006 / update 00320: train loss=16.3730 last lr=5.59E-08
2023-08-21 14:46:48,285 INFO -- trainer.1100368: Epoch 006 / update 00330: train loss=16.4598 last lr=5.50E-08
2023-08-21 14:46:54,390 INFO -- trainer.1100368: Epoch 006 / update 00340: train loss=16.4218 last lr=5.42E-08
2023-08-21 14:47:08,461 INFO -- trainer.1100368: Epoch 006 / update 00350: train loss=16.3906 last lr=5.35E-08
2023-08-21 14:47:09,067 INFO -- trainer.1100368: Run evaluation
2023-08-21 14:47:19,205 INFO -- trainer.1100368: Eval after 350 updates: loss=8.7462 best_loss=8.7462 patience_steps_left=3
2023-08-21 14:47:19,205 INFO -- trainer.1100368: Saving model
2023-08-21 14:47:44,981 INFO -- trainer.1100368: Epoch 007 / update 00360: train loss=16.4267 last lr=5.27E-08
2023-08-21 14:47:51,383 INFO -- trainer.1100368: Epoch 007 / update 00370: train loss=16.3630 last lr=5.20E-08
2023-08-21 14:47:58,305 INFO -- trainer.1100368: Epoch 007 / update 00380: train loss=16.3666 last lr=5.13E-08
2023-08-21 14:48:04,396 INFO -- trainer.1100368: Epoch 007 / update 00390: train loss=16.3605 last lr=5.06E-08
2023-08-21 14:48:10,630 INFO -- trainer.1100368: Epoch 007 / update 00400: train loss=16.3518 last lr=5.00E-08
...