UTUT

Official PyTorch implementation for the following paper:

Textless Unit-to-Unit training for Many-to-Many Multilingual Speech-to-Speech Translation
Minsu Kim*, Jeongsoo Choi*, Dahun Kim, Yong Man Ro
IEEE/ACM Transactions on Audio, Speech, and Language Processing
[Paper] [Demo]

Setup

Python >=3.7,<3.11

git clone -b main --single-branch https://github.com/choijeongsoo/utut
cd utut
git submodule init
git submodule update
pip install -e fairseq
pip install -r requirements.txt
apt-get install espeak

Model Checkpoints

Speech to Unit Quantization

mHuBERT Base, layer 11, km 1000

reference: textless_s2st_real_data

Unit to Unit Translation (UTUT)

Pre-trained Model

Task Pretraining Data Model

STS VoxPopuli (from year 2013), mTEDx download

TTS VoxPopuli (from year 2013), mTEDx download

TTST VoxPopuli (from year 2013), mTEDx download

Unit to Speech Synthesis

En (English), Es (Spanish), and Fr (French)

reference: textless_s2st_real_data

It (Italian), De (German), and Nl (Dutch)

Unit config	Unit size	Vocoder language	Dataset	Model
mHuBERT, layer 11	1000	It	M-AILABS (male)	ckpt, config
mHuBERT, layer 11	1000	De	CSS10	ckpt, config
mHuBERT, layer 11	1000	Nl	CSS10	ckpt, config

Inference

UTUT is pre-trained on Voxpopuli and mTEDx, where a large portion of data is from European Parliament events.
Before utilizing the pre-trained model, please consider the data domain where you want to apply it.

Pipeline for Speech-to-Speech Translation (STS)

$ cd utut
$ PYTHONPATH=fairseq python inference_sts.py \
  --in-wav-path samples/en/1.wav samples/en/2.wav samples/en/3.wav \
  --out-wav-path samples/es/1.wav samples/es/2.wav samples/es/3.wav \
  --src-lang en --tgt-lang es \
  --mhubert-path /path/to/mhubert_base_vp_en_es_fr_it3.pt \
  --kmeans-path /path/to/mhubert_base_vp_en_es_fr_it3_L11_km1000.bin \
  --utut-path /path/to/utut_sts.pt \
  --vocoder-path /path/to/vocoder_es.pt \
  --vocoder-cfg-path /path/to/config_es.json

Pipeline for Text-to-Speech Synthesis (TTS)

$ cd utut
$ PYTHONPATH=fairseq python inference_tts.py \
  --in-txt-path samples/en/a.txt samples/en/b.txt samples/en/c.txt \
  --out-wav-path samples/en/a.wav samples/en/b.wav samples/en/c.wav \
  --src-lang en --tgt-lang en \
  --utut-path /path/to/utut_tts.pt \
  --vocoder-path /path/to/vocoder_en.pt \
  --vocoder-cfg-path /path/to/config_en.json

Pipeline for Text-to-Speech Translation (TTST)

$ cd utut
$ PYTHONPATH=fairseq python inference_tts.py \
  --in-txt-path samples/en/a.txt samples/en/b.txt samples/en/c.txt \
  --out-wav-path samples/es/a.wav samples/es/b.wav samples/es/c.wav \
  --src-lang en --tgt-lang es \
  --utut-path /path/to/utut_ttst.pt \
  --vocoder-path /path/to/vocoder_es.pt \
  --vocoder-cfg-path /path/to/config_es.json

19 source languages: en (English), es (Spanish), fr (French), it (Italian), pt (Portuguese), el (Greek), ru (Russian), cs (Czech), da (Danish), de (German), fi (Finnish), hr (Croatian), hu (Hungarian), lt (Lithuanian), nl (Dutch), pl (Polish), ro (Romanian), sk (Slovak), and sl (Slovene)

6 target languages: en (English), es (Spanish), fr (French), it (Italian), de (German), and nl (Dutch)

Acknowledgement

This repository is built upon Fairseq and speech-resynthesis. We appreciate the open source of the projects.

Citation

If our work is useful for your research, please cite the following paper:

@article{kim2024textless,
  title={Textless Unit-to-Unit training for Many-to-Many Multilingual Speech-to-Speech Translation},
  author={Kim, Minsu and Choi, Jeongsoo and Kim, Dahun and Ro, Yong Man},
  journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing},
  year={2024}
}
@inproceedings{choi2024av2av,
  title={AV2AV: Direct Audio-Visual Speech to Audio-Visual Speech Translation with Unified Audio-Visual Speech Representation},
  author={Choi, Jeongsoo and Park, Se Jin and Kim, Minsu and Ro, Yong Man},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
data_helper		data_helper
fairseq @ 0338cdc		fairseq @ 0338cdc
imgs		imgs
samples/en		samples/en
speech2unit		speech2unit
unit2speech		unit2speech
unit2unit		unit2unit
.gitmodules		.gitmodules
README.md		README.md
dict.txt		dict.txt
inference_sts.py		inference_sts.py
inference_tts.py		inference_tts.py
phoneme_dict.txt		phoneme_dict.txt
phonemize.py		phonemize.py
requirements.txt		requirements.txt
util.py		util.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

UTUT

Setup

Model Checkpoints

Speech to Unit Quantization

Unit to Unit Translation (UTUT)

Unit to Speech Synthesis

Inference

Pipeline for Speech-to-Speech Translation (STS)

Pipeline for Text-to-Speech Synthesis (TTS)

Pipeline for Text-to-Speech Translation (TTST)

Acknowledgement

Citation

About

Languages

Task	Pretraining Data	Model
STS	VoxPopuli (from year 2013), mTEDx	download
TTS	VoxPopuli (from year 2013), mTEDx	download
TTST	VoxPopuli (from year 2013), mTEDx	download

choijeongsoo/utut

Folders and files

Latest commit

History

Repository files navigation

UTUT

Setup

Model Checkpoints

Speech to Unit Quantization

Unit to Unit Translation (UTUT)

Unit to Speech Synthesis

Inference

Pipeline for Speech-to-Speech Translation (STS)

Pipeline for Text-to-Speech Synthesis (TTS)

Pipeline for Text-to-Speech Translation (TTST)

Acknowledgement

Citation

About

Resources

Stars

Watchers

Forks

Languages