This repository provides a demo implementation of "Masked Modeling Duo for Speech: Specializing General-Purpose Audio Representation to Speech using Denoising Distillation."
- Code for pre-training and pre-processing LS-960 features.
- Pre-trained weights.
- SUPERB evaluation code and instructions.
For installation, follow the instruction in the "1. Getting Started" in the main README.md.
For evaluating on SUPREB, refer to superb/upstream/m2d/README.md.
Find pre-trained weight files in releases.
- M2D-S T=4.0s: m2d_s_vit_base-80x400p80x2-230201
- M2D-S T=5.12s: m2d_s_vit_base-80x512p80x2-230301
- M2D-S T=6.08s: m2d_s_vit_base-80x608p80x2-230220
Model | Pre-trained dataset | PR | KS | IC | SID | ER | ENV | MUS |
---|---|---|---|---|---|---|---|---|
M2D-S T=4.0s | LS-960+AS | 5.72 | 96.47 | 97.80 | 81.97 | 66.36 | 53.22 | 41.71 |
M2D-S T=5.12s | LS-960+AS | 5.64 | 96.87 | 97.65 | 80.69 | 65.35 | 57.34 | 43.23 |
M2D-S T=6.08s | LS-960+AS | 5.33 | 96.80 | 97.63 | 81.74 | 66.13 | 54.77 | 43.75 |
M2D-S learns from the following pre-processed files using LibriSpeech (LS-960) and HuBERT-base pre-trained model.
data/ls960_hybrid7s_hubaseL9
: Pre-processed data consists of log-mel spectrogram samples converted from LS-960 and HuBERT layer #9 features encoded from LS-960.data/files_ls960_hybrid.csv
: List of pre-processed files of thels960_hybrid7s_hubaseL9
folder.
M2D-S also requires AudioSet as a background noise.
data/audioset_lms
: Pre-processed log-mel spectrogram samples from AudioSet, as in the original M2D.data/files_audioset.csv
: List of pre-processed AudioSet files, as in the original M2D.
The following command line will create data/ls960_hybrid7s_hubaseL9
and data/files_ls960_hybrid.csv
.
python speech/extract_offline_ls960.py /path/to/LibriSpeech
For preparing AudioSet data files (data/audioset_lms
and data/files_audioset.csv
), please follow the data/README.md.
The train_speech.py
pre-trains for speech.
The following example would run on any affordable GPU, consuming only 7,170MiB. However, please note that it will take very long (It took over 20 minutes for one epoch).
You can also change the BG noise dataset by adding --csv_bg_noise data/files_f_s_d_5_0_k.csv
, for example.
python -m speech.train_speech --loss_m2d 1. --loss_off 1. --input_size 80x208 --patch_size 80x4 --noise_ratio 0.2 --batch_size 128 --accum_iter 16
The followings are for pre-training high-end models, taking 2.5-3.5 days to complete with 4 A100s.
OMP_NUM_THREADS=1 torchrun --nproc_per_node=4 -m speech.train_speech --loss_m2d 1. --loss_off .5 --input_size 80x400 --patch_size 80x2 --noise_ratio 0.2
OMP_NUM_THREADS=1 torchrun --nproc_per_node=4 -m speech.train_speech --loss_m2d 1. --loss_off .5 --input_size 80x512 --patch_size 80x2 --noise_ratio 0.2 --batch_size 256 --accum_iter 2
OMP_NUM_THREADS=1 torchrun --nproc_per_node=4 -m speech.train_speech --loss_m2d 1. --loss_off .5 --input_size 80x608 --patch_size 80x2 --noise_ratio 0.2 --batch_size 256 --accum_iter 2
- --batch_size: Batch size per GPU, 512 by default.
- --epochs: Training epochs, 1000 by default.
- --accum_iter: Iterations to accumulate gradients, 1 by default.
- --input_size: Input spectrogram size, 80x208 by default.
- --patch_size: Patch size, 80x4 by default.
- --mask_ratio: Masking ratio, 0.6 by default.
- --loss_m2d: Loss ratio for M2D masked prediction, 1.0 by default.
- --loss_off: Loss ratio for offline target, 0.0 by default.
- --blr: Base learning rate: absolute_lr = base_lr * total_batch_size / 256.
- --csv_main: A CSV file to list sample files in the main dataset, 'data/files_ls960_hybrid.csv' by default.
- --csv_bg_noise: A CSV file to list sample files in the BG noise dataset, 'data/files_audioset.csv' by default.
- --noise_ratio: Noise mixing ratio, 0.2 by default.
We provide upstream wrapper implementation, which you can import to your SUPERB environment.
- Copy the
superb/upstream/m2d
folder under yours3prl/upstream
folder. - Make a symbolic link to your copy of M2D repository under your
s3prl/upstream/m2d
, makings3prl/upstream/m2d/m2d
. The wrapper files will find M2D programs under this symbolic link. - You will need to run
pip install -e .
under yours3prl
folder, so that you install your local SUPERB in your Python environment.
Please refer to superb/upstream/m2d/README.md for more details.
- Our code is based on the MAE PyTorch/GPU re-implementation of the paper Masked Autoencoders Are Scalable Vision Learners.
- We use nnAudio (KinWaiCheuk/nnAudio) for converting raw audio into log-mel spectrogram.
- We use Hugging Face Transformers for the implementation and pre-trained weights of the HuBERT model.
We appreciate these publicly available resources.
If you find our M2D-S useful in your research, please consider citing our paper:
@article{niizumi2023m2d4speech,
title = {{Masked Modeling Duo for Speech: Specializing General-Purpose Audio Representation to Speech using Denoising Distillation}},
author = {Daisuke Niizumi and Daiki Takeuchi and Yasunori Ohishi and Noboru Harada and Kunio Kashino},
journal = {to appear at Interspeech},
year = {2023},
url = {https://arxiv.org/abs/2305.14079}
}
- SUPERB: Shu-wen Yang, Po-Han Chi, Yung-Sung Chuang, Cheng-I Jeff Lai, Kushal Lakhotia, Yist Y. Lin, Andy T. Liu, Jiatong Shi, Xuankai Chang, Guan-Ting Lin, Tzu-Hsien Huang, Wei-Cheng Tseng, Ko-tik Lee, Da-Rong Liu, Zili Huang, Shuyan Dong, Shang-Wen Li, Shinji Watanabe, Abdelrahman Mohamed, and Hung-yi Lee, "SUPERB: Speech Processing Universal PERformance Benchmark," Interspeech, 2021.
- HuBERT: W.-N. Hsu, B. Bolte, Y.-H. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed, “HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units,” IEEE/ACM Trans. Audio, Speech, Language Process., p.3451–3460, 2021.