Training from scratch? #12

Respaired · 2024-05-28T07:17:47Z

Traceback (most recent call last):
  File "/home/ubuntu/miniconda3/envs/vllm/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/ubuntu/miniconda3/envs/vllm/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/ubuntu/DDDM-VC/train_dddmvc.py", line 271, in <module>
    main()
  File "/home/ubuntu/DDDM-VC/train_dddmvc.py", line 42, in main
    mp.spawn(run, nprocs=n_gpus, args=(n_gpus, hps,))
  File "/home/ubuntu/miniconda3/envs/vllm/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 240, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/home/ubuntu/miniconda3/envs/vllm/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
    while not context.join():
  File "/home/ubuntu/miniconda3/envs/vllm/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 160, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException: 

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/home/ubuntu/miniconda3/envs/vllm/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
    fn(i, *args)
  File "/home/ubuntu/DDDM-VC/train_dddmvc.py", line 85, in run
    utils.load_checkpoint(path_ckpt, net_v, None)
  File "/home/ubuntu/DDDM-VC/utils.py", line 21, in load_checkpoint
    iteration = checkpoint_dict['iteration']
KeyError: 'iteration'

getting this error, sounds like your code throws an error if I don't feed it your base model, is that right?

the thing is, I don't want to fine tune your base model. I need to train it from scratch.

there's also a couple of mistakes in your code, like from hifigan.vocoder import HiFi is an invalid path so should be from vocoder.hifigan import HiFi + the path to your hifigan, dependencies etc.

So far, I've not been able to start a training session, but I hope it works on the regular hifigan checkpoints provided by its authors.

The text was updated successfully, but these errors were encountered:

Ashigarg123 · 2024-06-03T18:59:08Z

Hello @SoshyHayami, as per my understanding you will need the checkpoint of hifigan and vqvae but you don't necessarily need to give the checkpoint path of already trained model I think. Are you sure your paths to hifigan and vqvae are correct? Also, if you are loading a different hifigan checkpoint, could you show me which one? You might need to make changes to the load_checkpoint function (try printing the keys of the checkpoint dictionary, if 'iteration' exists.)

Respaired · 2024-06-04T14:22:57Z

Hello @SoshyHayami, as per my understanding you will need the checkpoint of hifigan and vqvae but you don't necessarily need to give the checkpoint path of already trained model I think. Are you sure your paths to hifigan and vqvae are correct? Also, if you are loading a different hifigan checkpoint, could you show me which one? You might need to make changes to the load_checkpoint function (try printing the keys of the checkpoint dictionary, if 'iteration' exists.)

Yeah Upon further investigation, I realized it has something to do with what you said.
I used the same ckpt as I told you in the other thread, i think it was trained using the same config as the original hifi gan repo albeit with minor tweaks to make it 24khz, I used the same config to re-train the hifi-gan and it kinda worked there.

should i simply manipulate the keys inside the hifigan's ckpt? I pretty much rather not do that since it becomes impossible for me to debug it given how many things I've already touched to configure it for 24khz.

I hope the author releases their 24khz config and vocoder if they have one.

Respaired · 2024-06-04T14:41:39Z

I looked at the keys, the author's vocoder have these:

model
iteration
optimizer
learning_rate

while mine is just a single generator. so I think the voc_ckpt the author used is more than just a vocoder. changing vocoders doesn't seem to be a trivial thing anymore, I may as well give up on this. ain't No way I'm gonna train a 16khz model.

hayeong0 · 2024-06-04T16:13:00Z

Hello,

In the train code, loading the vocoder is simply for checking the audio through validation.
It's okay to remove this part during training.

If you trained this model with the same Mel settings as your desired 24 kHz vocoder, you can use it for validation and inference. It seems that parts such as optimizer and iteration have been excluded, but you can load the pre-trained vocoder in the following way, so please try it.

from vocoder.models import Generator 

model = Generator(hps).cuda()
state_dict = load_checkpoint("your vocoder path", device="cuda")
model.load_state_dict(state_dict['generator'])
model.eval()
model.remove_weight_norm()

Respaired · 2024-06-04T16:21:06Z

Hello,

In the train code, loading the vocoder is simply for checking the audio through validation. It's okay to remove this part during training.

If you trained this model with the same Mel settings as your desired 24 kHz vocoder, you can use it for validation and inference. It seems that parts such as optimizer and iteration have been excluded, but you can load the pre-trained vocoder in the following way, so please try it.
from vocoder.models import Generator 

model = Generator(hps).cuda()
state_dict = load_checkpoint("your vocoder path", device="cuda")
model.load_state_dict(state_dict['generator'])
model.eval()
model.remove_weight_norm()

Sure, thanks. I'll try that.
in the meantime while you're here, may I ask you to consider making a branch for the 24khz version that you mentioned you've trained yourself? It's much better trying to reproduce that if possible. Thank you!

Respaired · 2024-06-09T09:20:09Z

after days of trying, I did pre-processing a few times, even 16khz. I always end up with:

" f0_start = np.random.randint(0, max_f0_start) File "mtrand.pyx", line 747, in numpy.random.mtrand.RandomState.randint File "_bounded_integers.pyx", line 1254, in numpy.random._bounded_integers._rand_int64 ValueError: low >= high

Well guys, time to give up.

hayeong0 · 2024-06-09T09:35:26Z

Hello, @SoshyHayami

It is a NumPy error. It seems to be an error related to an invalid range passed to the numpy.random.randint function.
If the max_f0_start value has become negative, it indicates that an incorrectly calculated value is being used.

It appears that the error occurs at this part of the line:

DDDM-VC/data_loader.py

Lines 50 to 52 in 9c1e57c

    
           max_f0_start = f0_norm.shape[-1] - self.segment_length//80 
        
           f0_start = np.random.randint(0, max_f0_start)

In this code, the value divided by 80 is used as the starting value for F0 because we use F0 with a resolution that is 4 times higher than that of the Mel scale. This adjustment ensures that the segment size matches. For the Mel scale, we use a hop size of 320. Please adjust this part of the code accordingly.

Thanks.

Respaired · 2024-06-10T10:18:42Z

@hayeong0

Hi, I understand it's because of the invalid range given to the randint. I used the exact same pre-processing steps on a 16khz audio dataset, aggressively cutting the silent parts. but since I get the "Invalid Value Encountered in Divide" mentioned in another issue, my best guess is something in my data doesn't quite work with the yaapt algorithm and thus resulting in a wrong f0 calculation. I used the same dataset on other models and they usually work great, so I can't pinpoint the exact place it messes it up.

btw, is the number 80 and 1280 related to number of mel bins and window size respetively? the vocoder I was going to use for the 24khz version uses this config :

{
    "resblock": "1",
    "num_gpus": 0,
    "batch_size": 20,
    "learning_rate": 0.00005,
    "adam_b1": 0.8,
    "adam_b2": 0.99,
    "lr_decay": 0.999,
    "seed": 1234,
 

    "upsample_rates": [10,5,3,2],
    "upsample_kernel_sizes": [20,10,6,4],
    "upsample_initial_channel": 512,
    "resblock_kernel_sizes": [3,7,11],
    "resblock_dilation_sizes": [[1,3,5], [1,3,5], [1,3,5]],

    "segment_size": 57600,
    "num_mels": 80,
    "num_freq": 1025,
    "n_fft": 2048,
    "hop_size": 300,
    "win_size": 1200,

    "sampling_rate": 24000,

    "fmin": 0,
    "fmax": 8000,
    "fmax_for_loss": null,

    "num_workers": 4,

    "dist_config": {
        "dist_backend": "nccl",
        "dist_url": "tcp://localhost:54321",
        "world_size": 1
    }
}

so I assumed all instances of 1280, whether in the utils or the extractions script must be changed to 1200. (I should re-emphasize that the main issue I was facing above regarding the f0 calculations are happening even with the unmodified code on a 16khz dataset, using the checkpoints you provided. so what I'm saying is unrelated to that here.)

markrmiller · 2024-07-20T22:31:46Z

@SoshyHayami, for what it's worth, I trained the model in 16K on 500-700 thousand wav files using Crepe for the pitch embeddings and had no issues (without any attention to silence issues, very diverse dataset)

I'm currently trying to do a 24K model, but I'm still fighting to get everything aligned (I'm trying to do it without extracting new crepe pitch embeddings as it will take me almost 2 days even with two 3090s on it.)

What did you do with wav2vec? Just keep the same model and give it 16k input?

@hayeong0 Thanks for sharing this awesome model by the way! I've been so disappointed with all of the publicly available voice conversion diffusion models. They pretty much all use the same core strategy and code, which invariably means they train on source voice to source voice, and regardless of the other innovations and tricks, the result is always poor.

Your Mixin strategy is brilliant! I don't fully understand why it works so well and that it seems to take way more positive style transfer than negative content transfer from it, but it does work so well! I have not had a chance to experiment much with my results yet, but from a couple of experiments, zero shot with over 4,000 speakers looked pretty darn impressive! It also nailed a couple of unique voices in the training data that always elude these other models.

markrmiller · 2024-10-27T22:00:54Z

I finally finished my 24k version of this model. I ended up finetuning a BigVGAN-V2 model for the vocoder. I also trained a VQGAN to change the melspectrograms to latent space for the diffusion part (for performance and to fit in all the additional attention) and I replaced the diffusion model’s UNET with one from the Diffusers library and changed the loss calculation and reverse diffusion to work with that (so you can use any Diffusers scheduler.) I configured the UNETs essentially as they are for Stable Diffusion SDXL, except I used three down sample layers instead of four, and there were not quite as many attention heads. For the hell of it, I also added a bunch of multi-paired conversions generated by a commercial voice conversion model (it disables mixing when it hits them) to the data. Then, I trained on about 500,000 WAV files. The results are pretty darn good.

Yaodada12 · 2024-10-28T06:45:55Z

I finally finished my 24k version of this model. I ended up finetuning a BigVGAN-V2 model for the vocoder. I also trained a VQGAN to change the melspectrograms to latent space for the diffusion part (for performance and to fit in all the additional attention) and I replaced the diffusion model’s UNET with one from the Diffusers library and changed the loss calculation and reverse diffusion to work with that (so you can use any Diffusers scheduler.) I configured the UNETs essentially as they are for Stable Diffusion SDXL, except I used three down sample layers instead of four, and there were not quite as many attention heads. For the hell of it, I also added a bunch of multi-paired conversions generated by a commercial voice conversion model (it disables mixing when it hits them) to the data. Then, I trained on about 500,000 WAV files. The results are pretty darn good.

Have you tried to use the author's pre-trained model to finetune on a single person's voice data? How is the result? I used a single-person audio on the [web] for finetuning, but the effect was poor, with problems such as electronic sound.

markrmiller · 2024-11-16T02:18:55Z

No, I never tried to finetune the original code. I did train the original from scratch on a custom dataset and it came out very good though.

markrmiller · 2024-11-16T02:22:01Z

Although, I should say, it required very clean audio files for good results. That bothered me, so I did change it to train 20% of the time on audio that has been deteriorated and / or overlapped with background noise. That has made it much less finicky about the input audio.

hayeong0 · 2024-11-18T02:26:55Z

Although, I should say, it required very clean audio files for good results. That bothered me, so I did change it to train 20% of the time on audio that has been deteriorated and / or overlapped with background noise. That has made it much less finicky about the input audio.

Thank you for sharing your experience! 😆🔥

When the target audio for conversion contains background noise or is noisy, the synthesized (converted) audio may also include noise. This happens because the style encoder extracts the style using average pooling on Mel-spectrograms, which doesn’t account for noise separately. As you mentioned, using clean audio for training and inference leads to better results. Approaches that use noisy data as training data are also a nice idea!

To address the noise in the current model structure, I found that increasing the encoder size and extending the training segment length were helpful. I’ve conducted follow-up research related to VC, improving audio quality and speaker similarity. I plan to upload the model by December!
Best of luck with your research as well!

markrmiller · 2024-11-20T00:25:59Z

Thanks! I'll definitely keep my eye out for anything regarding better speaker similarity!

I wasn't having so much of a problem with noise, it was more degradation. Windows 11 or the (drivers for it) is just terrible when it comes to getting clean audio from audio interface. I have two Focusrite interfaces and they are both almost always problematic. So you get these little micro dropouts, or distortions, and to the human ear, they are not that bad, you can tell the audio is not crystal clear, but it's not that bad - for a quick home recording, you wouldn't think much of it.

But those spots really trip up the voice conversion - as if you are multiplying the issue dramatically ie it doesn't fail sounding similar to how it should but a bit off, it fails hard. It's the same for others, like Sovits and RVC. So I just tried to simulate that kind of damage 20% of the time, and that largely solved it. I tossed in other stuff as well, like RIR room noise and background sounds, just for fun, but I don't know that that was very useful - I only compared against a bunch of audio I was having issues with regarding that specific unclean audio. The model using VQ-GAN latents is even more robust against it.

Yaodada12 · 2024-11-25T10:20:22Z

I finally finished my 24k version of this model. I ended up finetuning a BigVGAN-V2 model for the vocoder. I also trained a VQGAN to change the melspectrograms to latent space for the diffusion part (for performance and to fit in all the additional attention) and I replaced the diffusion model’s UNET with one from the Diffusers library and changed the loss calculation and reverse diffusion to work with that (so you can use any Diffusers scheduler.) I configured the UNETs essentially as they are for Stable Diffusion SDXL, except I used three down sample layers instead of four, and there were not quite as many attention heads. For the hell of it, I also added a bunch of multi-paired conversions generated by a commercial voice conversion model (it disables mixing when it hits them) to the data. Then, I trained on about 500,000 WAV files. The results are pretty darn good.

Have you tried to use the author's pre-trained model to finetune on a single person's voice data? How is the result? I used a single-person audio on the [web] for finetuning, but the effect was poor, with problems such as electronic sound.

@hayeong0 Is this algorithm not suitable for fine-tuning on a small amount of single-person data?

markrmiller · 2024-12-21T04:53:54Z

It probably doesn't work great for it. The training will mix voices. So it trains by mixing the speaker embedding with the content and pitch from other speakers. If you finetune on one voice, you don't get any mixing with other embeddings. It should work better than nothing, but it's certainly not ideal.

Biyani404198 · 2025-02-03T06:56:19Z

after days of trying, I did pre-processing a few times, even 16khz. I always end up with:

" f0_start = np.random.randint(0, max_f0_start) File "mtrand.pyx", line 747, in numpy.random.mtrand.RandomState.randint File "_bounded_integers.pyx", line 1254, in numpy.random._bounded_integers._rand_int64 ValueError: low >= high

Well guys, time to give up.

Hi, I was facing similar issue, but then I checked the filelist train_wav.txt and train_f0.txt. the number of filepaths mismatched in both including order in which they were put. There is no logic to match the filename in the code, hence you have to ensure that while creating filelists. I am able to run the training code after doing so.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training from scratch? #12

Training from scratch? #12

Respaired commented May 28, 2024

Ashigarg123 commented Jun 3, 2024 •

edited

Loading

Respaired commented Jun 4, 2024 •

edited

Loading

Respaired commented Jun 4, 2024 •

edited

Loading

hayeong0 commented Jun 4, 2024

Respaired commented Jun 4, 2024

Respaired commented Jun 9, 2024

hayeong0 commented Jun 9, 2024

Respaired commented Jun 10, 2024 •

edited

Loading

markrmiller commented Jul 20, 2024 •

edited

Loading

markrmiller commented Oct 27, 2024

Yaodada12 commented Oct 28, 2024

markrmiller commented Nov 16, 2024

markrmiller commented Nov 16, 2024 •

edited

Loading

hayeong0 commented Nov 18, 2024

markrmiller commented Nov 20, 2024

Yaodada12 commented Nov 25, 2024

markrmiller commented Dec 21, 2024 •

edited

Loading

Biyani404198 commented Feb 3, 2025

Training from scratch? #12

Training from scratch? #12

Comments

Respaired commented May 28, 2024

Ashigarg123 commented Jun 3, 2024 • edited Loading

Respaired commented Jun 4, 2024 • edited Loading

Respaired commented Jun 4, 2024 • edited Loading

hayeong0 commented Jun 4, 2024

Respaired commented Jun 4, 2024

Respaired commented Jun 9, 2024

hayeong0 commented Jun 9, 2024

Respaired commented Jun 10, 2024 • edited Loading

markrmiller commented Jul 20, 2024 • edited Loading

markrmiller commented Oct 27, 2024

Yaodada12 commented Oct 28, 2024

markrmiller commented Nov 16, 2024

markrmiller commented Nov 16, 2024 • edited Loading

hayeong0 commented Nov 18, 2024

markrmiller commented Nov 20, 2024

Yaodada12 commented Nov 25, 2024

markrmiller commented Dec 21, 2024 • edited Loading

Biyani404198 commented Feb 3, 2025

Ashigarg123 commented Jun 3, 2024 •

edited

Loading

Respaired commented Jun 4, 2024 •

edited

Loading

Respaired commented Jun 4, 2024 •

edited

Loading

Respaired commented Jun 10, 2024 •

edited

Loading

markrmiller commented Jul 20, 2024 •

edited

Loading

markrmiller commented Nov 16, 2024 •

edited

Loading

markrmiller commented Dec 21, 2024 •

edited

Loading