-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Training from scratch? #12
Comments
Hello @SoshyHayami, as per my understanding you will need the checkpoint of hifigan and vqvae but you don't necessarily need to give the checkpoint path of already trained model I think. Are you sure your paths to hifigan and vqvae are correct? Also, if you are loading a different hifigan checkpoint, could you show me which one? You might need to make changes to the load_checkpoint function (try printing the keys of the checkpoint dictionary, if 'iteration' exists.) |
Yeah Upon further investigation, I realized it has something to do with what you said. should i simply manipulate the keys inside the hifigan's ckpt? I pretty much rather not do that since it becomes impossible for me to debug it given how many things I've already touched to configure it for 24khz. I hope the author releases their 24khz config and vocoder if they have one. |
I looked at the keys, the author's vocoder have these:
while mine is just a single |
Hello, In the train code, loading the vocoder is simply for checking the audio through validation. If you trained this model with the same Mel settings as your desired 24 kHz vocoder, you can use it for validation and inference. It seems that parts such as optimizer and iteration have been excluded, but you can load the pre-trained vocoder in the following way, so please try it. from vocoder.models import Generator
model = Generator(hps).cuda()
state_dict = load_checkpoint("your vocoder path", device="cuda")
model.load_state_dict(state_dict['generator'])
model.eval()
model.remove_weight_norm() |
Sure, thanks. I'll try that. |
after days of trying, I did pre-processing a few times, even 16khz. I always end up with: " Well guys, time to give up. |
Hello, @SoshyHayami It is a NumPy error. It seems to be an error related to an invalid range passed to the It appears that the error occurs at this part of the line: Lines 50 to 52 in 9c1e57c
In this code, the value divided by 80 is used as the starting value for F0 because we use F0 with a resolution that is 4 times higher than that of the Mel scale. This adjustment ensures that the segment size matches. For the Mel scale, we use a hop size of 320. Please adjust this part of the code accordingly. Thanks. |
Hi, I understand it's because of the invalid range given to the randint. I used the exact same pre-processing steps on a 16khz audio dataset, aggressively cutting the silent parts. but since I get the btw, is the number 80 and 1280 related to number of mel bins and window size respetively? the vocoder I was going to use for the 24khz version uses this config :
so I assumed all instances of 1280, whether in the utils or the extractions script must be changed to 1200. (I should re-emphasize that the main issue I was facing above regarding the f0 calculations are happening even with the unmodified code on a 16khz dataset, using the checkpoints you provided. so what I'm saying is unrelated to that here.) |
@SoshyHayami, for what it's worth, I trained the model in 16K on 500-700 thousand wav files using Crepe for the pitch embeddings and had no issues (without any attention to silence issues, very diverse dataset) I'm currently trying to do a 24K model, but I'm still fighting to get everything aligned (I'm trying to do it without extracting new crepe pitch embeddings as it will take me almost 2 days even with two 3090s on it.) What did you do with wav2vec? Just keep the same model and give it 16k input? @hayeong0 Thanks for sharing this awesome model by the way! I've been so disappointed with all of the publicly available voice conversion diffusion models. They pretty much all use the same core strategy and code, which invariably means they train on source voice to source voice, and regardless of the other innovations and tricks, the result is always poor. Your Mixin strategy is brilliant! I don't fully understand why it works so well and that it seems to take way more positive style transfer than negative content transfer from it, but it does work so well! I have not had a chance to experiment much with my results yet, but from a couple of experiments, zero shot with over 4,000 speakers looked pretty darn impressive! It also nailed a couple of unique voices in the training data that always elude these other models. |
I finally finished my 24k version of this model. I ended up finetuning a BigVGAN-V2 model for the vocoder. I also trained a VQGAN to change the melspectrograms to latent space for the diffusion part (for performance and to fit in all the additional attention) and I replaced the diffusion model’s UNET with one from the Diffusers library and changed the loss calculation and reverse diffusion to work with that (so you can use any Diffusers scheduler.) I configured the UNETs essentially as they are for Stable Diffusion SDXL, except I used three down sample layers instead of four, and there were not quite as many attention heads. For the hell of it, I also added a bunch of multi-paired conversions generated by a commercial voice conversion model (it disables mixing when it hits them) to the data. Then, I trained on about 500,000 WAV files. The results are pretty darn good. |
Have you tried to use the author's pre-trained model to finetune on a single person's voice data? How is the result? I used a single-person audio on the [web] for finetuning, but the effect was poor, with problems such as electronic sound. |
No, I never tried to finetune the original code. I did train the original from scratch on a custom dataset and it came out very good though. |
Although, I should say, it required very clean audio files for good results. That bothered me, so I did change it to train 20% of the time on audio that has been deteriorated and / or overlapped with background noise. That has made it much less finicky about the input audio. |
Thank you for sharing your experience! 😆🔥 When the target audio for conversion contains background noise or is noisy, the synthesized (converted) audio may also include noise. This happens because the style encoder extracts the style using average pooling on Mel-spectrograms, which doesn’t account for noise separately. As you mentioned, using clean audio for training and inference leads to better results. Approaches that use noisy data as training data are also a nice idea! To address the noise in the current model structure, I found that increasing the encoder size and extending the training segment length were helpful. I’ve conducted follow-up research related to VC, improving audio quality and speaker similarity. I plan to upload the model by December! |
Thanks! I'll definitely keep my eye out for anything regarding better speaker similarity! I wasn't having so much of a problem with noise, it was more degradation. Windows 11 or the (drivers for it) is just terrible when it comes to getting clean audio from audio interface. I have two Focusrite interfaces and they are both almost always problematic. So you get these little micro dropouts, or distortions, and to the human ear, they are not that bad, you can tell the audio is not crystal clear, but it's not that bad - for a quick home recording, you wouldn't think much of it. But those spots really trip up the voice conversion - as if you are multiplying the issue dramatically ie it doesn't fail sounding similar to how it should but a bit off, it fails hard. It's the same for others, like Sovits and RVC. So I just tried to simulate that kind of damage 20% of the time, and that largely solved it. I tossed in other stuff as well, like RIR room noise and background sounds, just for fun, but I don't know that that was very useful - I only compared against a bunch of audio I was having issues with regarding that specific unclean audio. The model using VQ-GAN latents is even more robust against it. |
@hayeong0 Is this algorithm not suitable for fine-tuning on a small amount of single-person data? |
It probably doesn't work great for it. The training will mix voices. So it trains by mixing the speaker embedding with the content and pitch from other speakers. If you finetune on one voice, you don't get any mixing with other embeddings. It should work better than nothing, but it's certainly not ideal. |
Hi, I was facing similar issue, but then I checked the filelist train_wav.txt and train_f0.txt. the number of filepaths mismatched in both including order in which they were put. There is no logic to match the filename in the code, hence you have to ensure that while creating filelists. I am able to run the training code after doing so. |
getting this error, sounds like your code throws an error if I don't feed it your base model, is that right?
the thing is, I don't want to fine tune your base model. I need to train it from scratch.
there's also a couple of mistakes in your code, like
from hifigan.vocoder import HiFi
is an invalid path so should befrom vocoder.hifigan import HiFi
+ the path to your hifigan, dependencies etc.So far, I've not been able to start a training session, but I hope it works on the regular hifigan checkpoints provided by its authors.
The text was updated successfully, but these errors were encountered: