This repository is dedicated to the "reborn-uasr" project, an initiative focused on enhancing Unsupervised Automatic Speech Recognition (ASR) through the implementation of Reinforcement Learning (RL) techniques for segmenter training.
The simplest way to access the REBORN models is through Hugging Face. We have wrapped our model including PCA dimension reduction matrix, REBORN segmenter, and REBORN generator into the Hugging Face supported form. Furthermore, we've also built the datasets corresponding to the models to Hugging Face (LibrSpeech 100 hours, Multilingual LibriSpeech across 6 languages). For those who want to have a quick start, please checkout our demo on Google Colab.
To replicate the REBORN end-to-end unsupervised phoneme recognition result, one would need:
- The upstream model (wav2vec 2.0) as feature extracter.
- The REBORN model (including the PCA dimension reduction matrix, the segmenter, and the generator).
- The corresponding dataset.
Since all of the components are available on Hugging Face, users can follow our demo on Google Colab to generate the results across different datasets by simply replacing card names of the models and datasets. Here, we summarize all the available pairings of the card names below for convenience:
Description | upstream_model_card | reborn_model_card | dataset_card | dataset_name | split |
---|---|---|---|---|---|
LibriSpeech 100 hour @ iter2-stage1 | facebook/wav2vec2-large-lv60 | andybi7676/reborn-uasr_ls100h_iter2-stage1 | andybi7676/reborn-uasr_librispeech-no-silence-100hr | {train.clean.100, dev.clean, dev.other, test.clean, test.other, dev.clean.small} | |
LibriSpeech 100 hour @ iter5-stage1 | facebook/wav2vec2-large-lv60 | andybi7676/reborn-uasr_ls100h_iter5-stage1 | andybi7676/reborn-uasr_librispeech-no-silence-100hr | {train.clean.100, dev.clean, dev.other, test.clean, test.other, dev.clean.small} | |
Multilingual LibriSpeech 100 hour German @ iter2-stage1 | facebook/wav2vec2-large-xlsr-53 | andybi7676/reborn-uasr_mls-de_iter2-stage1 | andybi7676/reborn-uasr_multilingual-librispeech-no-silence-100hr | german | {train.100hr, dev, test, dev.small} |
Multilingual LibriSpeech 100 hour Dutch @ iter2-stage1 | facebook/wav2vec2-large-xlsr-53 | andybi7676/reborn-uasr_mls-de_iter2-stage1 | andybi7676/reborn-uasr_multilingual-librispeech-no-silence-100hr | dutch | {train.100hr, dev, test, dev.small} |
Multilingual LibriSpeech 100 hour French @ iter2-stage1 | facebook/wav2vec2-large-xlsr-53 | andybi7676/reborn-uasr_mls-de_iter2-stage1 | andybi7676/reborn-uasr_multilingual-librispeech-no-silence-100hr | french | {train.100hr, dev, test, dev.small} |
Multilingual LibriSpeech 100 hour Spanish @ iter2-stage1 | facebook/wav2vec2-large-xlsr-53 | andybi7676/reborn-uasr_mls-de_iter2-stage1 | andybi7676/reborn-uasr_multilingual-librispeech-no-silence-100hr | spanish | {train.100hr, dev, test, dev.small} |
Multilingual LibriSpeech 100 hour Italian @ iter2-stage1 | facebook/wav2vec2-large-xlsr-53 | andybi7676/reborn-uasr_mls-de_iter2-stage1 | andybi7676/reborn-uasr_multilingual-librispeech-no-silence-100hr | italian | {train.100hr, dev, test, dev.small} |
Multilingual LibriSpeech 100 hour Portuguese @ iter2-stage1 | facebook/wav2vec2-large-xlsr-53 | andybi7676/reborn-uasr_mls-de_iter2-stage1 | andybi7676/reborn-uasr_multilingual-librispeech-no-silence-100hr | portuguese | {train.100hr, dev, test, dev.small} |
By replacing the card names, users can directly experience our pre-trained REBORN models with little efforts.
If you want to build up the environment and train the REBORN model on your own, please follow the below content first to meet the requirements.
We provide the pre-built docker image on the Docker Hub. The image contains all the dependencies for training reborn. This might be the simpliest way to setup the whole environment if you are familiar with Docker. Type the following command to pull and run the container based on the image.
docker run -it --rm --gpus all andybi7676/reborn-uasr:latest
Note that this is just an example of using the image in interactive mode with all the gpus on your machine. Feel free to use it in your own way. If the gpus are not available inside the container, please verify that nvidia-docker is installed.
In this section we are going to give instructions on how to build up the REBORN environment step by step. If you are using the reborn-uasr docker image, you can skip this section directly.
We have attach the fairseq version we use in the folder reborn-uasr/fairseq
. You can use it by cloning our repo to make sure that there is no version biases which may possibly lead to unexpected errors.
git clone https://github.com/andybi7676/reborn-uasr.git
cd reborn-uasr/fairseq
pip install -e .
Please follow the instruction from the official repo of kenlm. Please make sure that the python bindings is also installed (pip install https://github.com/kpu/kenlm/archive/master.zip
).
cd /your/path/to/reborn-uasr
pip install -r requirements.txt
Modify and run path.sh
to export fairseq and reborn-uasr to PYTHONPATH.
- Modify the /path/to/fairseq to export the corrent fairseq path into the environment.
- run
source path.sh
to appendfairseq
andreborn-uasr
into the PYTHONPATH. The result should be as follow:(base) username@desktop:/your/path/to/reborn-uasr$ source path.sh Added /your/path/to/fairseq to PYTHONPATH Appended /your/path/to/reborn-uasr to PYTHONPATH ======================================================================================= FAIRSEQ_ROOT: /your/path/to/fairseq REBORN_WORK_DIR: /your/path/to/reborn-uasr PYTHONPATH: /your/path/to/fairseq:/your/path/to/reborn-uasr Please make sure that FAIRSEQ_ROOT and REBORN_WORK_DIR are in PYTHONPATH During each runtime, please make sure to run `source path.sh` to set up the environment. ======================================================================================= Testing the required import functionality... SUCCESS
TBA
TBA
In this section, we will introduce how to train your own reborn model from scratch. Before diving into the training part, we recommend users go through the Prerequisite section and make sure that all the requirements have been satisfied.
We divide the training process into the following three main stages: wav2vec-U initialization, segmenter training, and generator (phoneme prediction model) training.
In this step, we initialize the CNN-based segmenter using pseudo-boundaries derived from a wav2vec-U model. This pretraining step provides a solid starting point before we move on to reinforcement learning.
bash rl/cnn_segmenter/_pretrain.sh
Expected Output: cnn_segmenter.pt
in the specified output_dir
.
Important Arguments to Adjust:
-
reborn_dir
: Root directory of thereborn-uasr
codebase. -
output_dir
: Directory where pretraining results and checkpoints will be stored. -
audio_dir
: Directory containing features and boundary files. Example structure:audio_dir ├── CLUS128 │ ├── train.bds │ └── valid.bds ├── precompute_pca512 ├── train.npy └── valid.npy
After pretraining, we refine the segmenter using reinforcement learning. The RL step optimizes the segmenter by considering language model perplexity, phoneme-level token error rates, and length ratio constraints, thereby improving segmentation quality.
bash rl/cnn_segmenter/_train.sh
Expected Output: Multiple RL-updated checkpoints, for example: rl_agent_segmenter_best.pt
.
Important Arguments to Adjust:
-
reborn_dir
,output_dir
: As in pretraining, ensure these are set correctly. -
audio_dir
: Move the wav2vec-U logit-segmented phoneme results to:audio_dir ├── precompute_pca512 │ ├── train.npy │ ├── train.w2vu_logit_segmented_units │ ├── valid.npy │ └── valid.w2vu_logit_segmented_units
-
kenlm_fpath
: Path to the KenLM language model. -
Pretrain_segmenter_path
: Path to the pretrained segmenter checkpoint from the Behavior Cloning step. -
Pretrain_wav2vecu_path
: Path to the wav2vec-U checkpoint used for feature extraction/logit generation. -
Adjust
coef_ter
,coef_len
, andlr
in_train.sh
to tune performance.
Use the rl/utils/_evaluate.sh
script to evaluate your trained segmenter against development and test splits. This script generates phoneme sequences and compares them against ground truth references.
Key Arguments:
reborn_dir
,output_dir
: Ensure these match your setup.generator_ckpt
: Path to the wav2vec-U generator model checkpoint.feats_dir
: Directory containing the PCA-reduced features used during evaluation.
Please cite this work as:
@article{tseng2024reborn,
title={REBORN: Reinforcement-Learned Boundary Segmentation with Iterative Training for Unsupervised ASR},
author={Tseng, Liang-Hsuan and Hu, En-Pei and Chiang, Cheng-Han and Tseng, Yuan and Lee, Hung-yi and Lee, Lin-shan and Sun, Shao-Hua},
journal={arXiv preprint arXiv:2402.03988},
year={2024}
}