This repository implements Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis (SV2TTS) for the Persian language. The core codebase is derived from this repository, which has been updated to address deprecated features and complete setup for Persian language compatibility. The original codebase, sourced from this repository, has been modified to support Persian language requirements.
1. Character-set definition:
Open the synthesizer/persian_utils/symbols.py
file and update the _characters
variable to include all the characters that exist in your text files. Most of Persian characters and symbols are already included in this variable as follows:
_characters = "ءابتثجحخدذرزسشصضطظعغفقلمنهويِپچژکگیآۀأؤإئًَُّ!(),-.:;? ̠،…؛؟٪#ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz_–@+/\u200c"
2. Data structures:
dataset/persian_date/
train_data/
speaker1/book-1/
sample1.txt
sample1.wav
...
...
test_data/
...
3. Preprocessing:
python3 synthesizer_preprocess_audio.py dataset --datasets_name persian_data --subfolders train_data --no_alignments --skip_existing --n_processes 4 --out_dir dataset/train/SV2TTS/synthesizer
python3 synthesizer_preprocess_audio.py dataset --datasets_name persian_data --subfolders test_data --no_alignments --skip_existing --n_processes 4 --out_dir dataset/test/SV2TTS/synthesizer
- Embedding Preprocessing
python3 synthesizer_preprocess_embeds.py dataset/train/SV2TTS/synthesizer
python3 synthesizer_preprocess_embeds.py dataset/test/SV2TTS/synthesizer
4. Train synthesizer:
python3 synthesizer_train.py my_run dataset/train/SV2TTS/synthesizer
To generate a wav file, place all trained models in the saved_models/final_models
directory. If you haven’t trained the speaker encoder or vocoder models, you can use pretrained models from saved_models/default
. These models include encoder.pt
, your latest synthesizer checkpoint like synthesizer_000300.pt
, and a vocoder as follows.
python3 inference.py --vocoder "WavRNN" --text "یک نمونه از خروجی" --ref_wav_path "/path/to/sample/reference.wav" --test_name "test1"
WavRNN is an old vocoder and if you want to use HiFiGAN you must first download a pretrained model in English.
- Install Parallel WaveGAN
pip install parallel_wavegan
- Download Pretrained HiFiGAN Model
from parallel_wavegan.utils import download_pretrained_model
download_pretrained_model("vctk_hifigan.v1", "saved_models/final_models/vocoder_HiFiGAN")
- Run Inference with HiFiGAN
python3 inference.py --vocoder "HiFiGAN" --text "یک نمونه از خروجی" --ref_wav_path "/path/to/sample/reference.wav" --test_name "test1"
This architecture has been used to train a Persian Text-to-Speech (TTS) model on the ManaTTS dataset, the largest publicly available single-speaker Persian corpus. The trained model weights and detailed inference instructions can be found in the following repositories:
- Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis Ye Jia, et al.,
- Real-Time-Voice-Cloning repository,
- ParallelWaveGAN repository
- Persian-MultiSpeaker-Tacotron2
This project is based on Real-Time-Voice-Cloning,
which is licensed under the MIT License.
Modified & original work Copyright (c) 2019 Corentin Jemine (https://github.com/CorentinJ)
Original work Copyright (c) 2018 Rayhane Mama (https://github.com/Rayhane-mamah)
Original work Copyright (c) 2019 fatchord (https://github.com/fatchord)
Original work Copyright (c) 2015 braindead (https://github.com/braindead)
Modified work Copyright (c) 2025 Majid Adibian (https://github.com/Adibian)
Modified work Copyright (c) 2025 Mahta Fetrat (https://github.com/MahtaFetrat)