Skip to content

Voice Cloning with your personal data

syq163 edited this page Dec 22, 2023 · 7 revisions

中文

People have different preferences for voices. In response to the community's needs, we are thrilled to release the voice cloning code with tutorials.

Precautions Before Starting:

  1. At least one Nvidia's GPU card is required for training and voice cloning.
  2. Data for target voice is crucial for voice cloning. The detailed requirements are provided in the next section.
  3. Currently, only Chinese and English are supported, meaning you can use either Chinese data or English data, or both to train your voice, resulting in a model capable of speaking both languages.
  4. Although EmotiVoice supports emotional prompts, if you want your voice to convey emotions, your data should already contain emotional elements.
  5. After training solely with your data, the original voices from EmotiVoice will be altered. This means that the new model will be entirely customized based on your data. If you wish to use EmotiVoice's original 2000+ voices, it is recommended to use the pre-trained model instead.

Detailed requirements for training data

  1. Audio data should have high qualities, such as clear and undistorted speech from a single individual.
  2. Text corresponding to each audio should align with the content of the speech. Before training, the original text is converted into phonemes using G2P. It is important to pay special attention to short pauses (sp*) and polyphones, as they can have an impact on the quality of training.
  3. If you desire your voice to convey emotions, your data should already contain emotional elements. Additionally, the content of the tag 'prompt' should be appropriately modified for each audio. Prompts can include emotions, speed, and any form of text descriptions of the speaking style.
  4. After that, you shoud obtain a data directory which contains two subdirectories, named train and valid. Each subdirectory has a datalist.jsonl file with the following format: {"key": "LJ002-0020", "wav_path": "data/LJspeech/wavs/LJ002-0020.wav", "speaker": "LJ", "text": ["<sos/eos>", "[IH0]", "[N]", "engsp1", "[EY0]", "[T]", "[IY1]", "[N]", "engsp1", "[TH]", "[ER1]", "[T]", "[IY1]", "[N]", ".", "<sos/eos>"], "original_text": "In 1813", "prompt": "common"} for each single line.

Step-by-Step Training Process:

The best tutorial for Mandarin Chinese is our DataBaker Recipe, and for English is LJSpeech Recipe. Below is a summary:

  1. Prepare the training environment; this step is only necessary once.

    # create conda enviroment
    conda create -n EmotiVoice python=3.8 -y
    conda activate EmotiVoice
    # then run:
    pip install EmotiVoice[train]
    # or
    git clone https://github.com/netease-youdao/EmotiVoice
    pip install -e .[train]
  2. Prepare the data according to the Detailed requirements for training data section. Of course, you can use the provided methods and scripts from the DataBaker Recipe and LJSpeech Recipe.

  3. Next, run the following command to create a directory for training: python prepare_for_training.py --data_dir <data directory> --exp_dir <experiment directory>.

    Replace <data directory> with the actual path to your data directory and <experiment directory> with the desired path for your experiment directory.

  4. You can customize the training settings by modifying the parameters in <experiment directory>/config/config.py based on your server and data. Once you have made the necessary changes, initiate the training process by running the following command: torchrun --nproc_per_node=1 --master_port 8018 train_am_vocoder_joint.py --config_folder <experiment directory>/config --load_pretrained_model True. This command will start the training process using the specified configuration folder and load any pre-trained models if specified.

  5. After several training epochs, select some checkpoints and run the following comand to perform inference to verify if they meet your expectations: python inference_am_vocoder_exp.py --config_folder exp/DataBaker/config --checkpoint g_00010000 --test_file data/inference/text. Please be reminded to modify the speaker name in data/inference/text. If the results are satisfactory, you can utilize the new model as desired. We also provide a modified version of demo page: demo_page_databaker.py.

  6. If the results are not up to par, you can either wait for further training epochs or review your data and environment. Of course, you can consult with the community or create an issue for assistance.

Reference Information for Running Time:

The following information regarding running time and hardware environment is provided for your reference:

  • Pip package versions: Python 3.8.18, torch 1.13.1, cuda 11.7
  • GPU card type: NVIDIA GeForce RTX 3090, NVIDIA A40
  • Training time: Approximately 1 to 2 hours are required to train for 10,000 steps.

It is even capable of training without the use of Nvidia's GPU card! Just be patient and wait for a while.