Multilingual Speech Synthesis System Using VITS
- A Windows/Linux system with a minimum of
16GB
RAM. - A GPU with at least
12GB
of VRAM. - Python == 3.8
- Anaconda installed.
- PyTorch installed.
- CUDA 11.x installed.
- Zlib DLL installed.
Pytorch install command:
pip install torch==1.13.1+cu117 torchvision==0.14.1+cu117 torchaudio==0.13.1 --extra-index-url https://download.pytorch.org/whl/cu117
CUDA 11.7 install:
https://developer.nvidia.com/cuda-11-7-0-download-archive
Zlib DLL install:
https://docs.nvidia.com/deeplearning/cudnn/install-guide/index.html#install-zlib-windows
Install pyopenjtalk Manually:
pip install -U pyopenjtalk --no-build-isolation
If this command does not install, please install the following library before proceeding:
cmake
Cython
- Create an Anaconda environment:
conda create -n polylangvits python=3.8
- Activate the environment:
conda activate polylangvits
- Clone this repository to your local machine:
git clone https://github.com/ORI-Muchim/PolyLangVITS.git
- Navigate to the cloned directory:
cd PolyLangVITS
- Install the necessary dependencies:
pip install -r requirements.txt
Place the audio files as follows.
.mp3 or .wav files are okay.
You must write '[language code]' on the back of the speaker folder.
PolyLangVITS
├────datasets
│ ├───speaker0[KO]
│ │ ├────1.mp3
│ │ └────1.wav
│ └───speaker1[JA]
│ │ ├───1.mp3
│ │ └───1.wav
│ ├───speaker2[EN]
│ │ ├────1.mp3
│ │ └────1.wav
│ ├───speaker3[ZH]
│ │ ├────1.mp3
│ │ └────1.wav
│ ├integral.py
│ └integral_low.py
│
├────vits
├────get_pretrained_model.py
├────inference.py
├────main_low.py
├────main_resume.py
├────main.py
├────Readme.md
└────requirements.txt
This is just an example, and it's okay to add more speakers.
To start this tool, use the following command, replacing {language}, {model_name}, and {sample_rate} with your respective values:
python main.py {language} {model_name} {sample_rate}
For those with low specifications(VRAM < 12GB), please use this code:
python main_low.py {language} {model_name} {sample_rate}
If the data configuration is complete and you want to resume training, enter this code:
python main_resume.py {model_name}
After the model has been trained, you can generate predictions by using the following command, replacing {model_name} and {model_step} with your respective values:
python inference.py {model_name} {model_step}
For text to speech inference, use the following:
python inference-stt.py {model_name} {model_step}
Also, you may manually pass the text without editing the code by:
python inference-stt.py {model_name} {model_step} {text}
For more information, please refer to the following repositories: