-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Adding BigVGAN as Vocoder #14
Comments
It depends on your BigVGAN model. I'm a little curious how you got your BigVGAN as the there is no official pretrained model released yet. If it was not trained to match the data preprocessing in meldataset.py, it would not work. You will need to do some rescaling and interpolation to match the preprocessing there (if the sampling rate is the same), or you will need to retrian your StyleTTS or BigVGAN to have the same preprocesing. When there is an official pretrianed model, I will test the quality and see if it is worth it to change all my repos to match that preprocessing for 24 kHz. |
im currently training a model (using vast.ai, 3 3090s) based on the github instructions and using Lion-pytorch optimizer in place of AdamW. If you want a DL of the Generator let me know ill get you a link to test. i changed num_mels to 80 to match StyleTTS, and left the rest to match the BigVGAN config. Could that be the issue?
|
It is not about the configuration. You need to match the preprocessing here as well. See yl4579/StarGANv2-VC#59 (comment) |
Since you are training BigVGAN you would also have to change n_fft=2048, win_length=1200, hop_length=300 in your BigVGAN config to match the params of StyleTTS. |
oh okay gotcha, so basically retrain the preprocessing to match the same config, or just match the settings. sorry I'm just a tinkerer realistically :P Edit: I think I read somewhere hop length has to be 256 for some reason for BigVGAN. so I could go either way train BigVGAN with StyleTTS setting or retrain the preprocessors to match BigVGAN settings? |
To add on @lexkoro, you also need to change how the melspectrogram is computed. In Hifi-GAN and BigVGAN, they are generated using librosa, while in all of my repos they are generated with torchaudio. There is a slight difference in how they compute melspectrogram, but all of these can be fixed by doing reverse mel and recomputing the mel scale using another library. If I have extra time, I will change all of my repos to match that of BigVGAN (provided their pretrained models have better qualities than the vocoder I have). |
Gotcha thanks for the help, yeah was super curious because of their examples, to see how it would work with your repo. here is the BigVGAN generator trained up to 100,000 steps with pretty great quality IMO, though I haven't heard plain hifiGAN: https://drive.google.com/file/d/1VRUq3hjjXloTQ-gcYMRDsVpBqd_GXPrN/view?usp=sharing config:
|
You can either retrain the text aligner, pitch extractor and StyleTTS altogether with BigVGAN settings, or you can retrain the BigVGAN with StyleTTS settings. Whichever way works faster for you. If you do the former, it also saves the time for me because you can directly tell me how well it works on BigVGAN and I can create a new branch with all of the pretrained models you have there. |
sounds good. i might train everything, ill let you know! |
Should i convert from torchaudio to librosa in AuxiliaryASR and PitchExtractor or just leave it with torchaudio?
|
You should use completely the way BIGVGAN processes the melspectrogram. You probably need to replace the meldataset class in this file entirely with this: https://github.com/NVIDIA/BigVGAN/blob/main/meldataset.py, except you also return the text labels. |
They have released their pre-traiend models on 22k and it sounds quite good. I will try to retrain all the models to match BIGVGAN's setting and create a new branch for all of my repos. |
Lol didn't expect them to release models so quick. I Might try continuing converting the scripts to, but might be a little hard for me, ill post if i theres any progress. Thats Awesome, can't wait! Good luck! I wonder if it would be worth adding CLAP at some point to classify/direct the style or quality of audio somehow https://github.com/LAION-AI/CLAP I have no idea if this even worth mentioning 🤷♂️ |
Maybe you are referring to something like this: http://dongchaoyang.top/InstructTTS? Note that it probably works for StyleTTS too if you have the right data, but their data isn't public, and CLAP isn't suitable for text-instructed emotion control as the CLAP isn't about speech but audio in general. |
gotcha, well thanks for all the help/information, wish I could help out more, much appreciated. |
Hello @yl4579 |
@arampacha I'm very busy with my other projects, so it might be difficult to find the Tensorboard logs for the LibriTTS dataset now. But could you please tell me if you were able to reproduce the results with BigVGAN recipes on LJSpeech, or were you able to train a model with the original recipe on LibriTTS? Just want to make sure there's nothing wrong with the code I uploaded. I only tested this repo on LJSpeech unfortunately, the LibriTTS was the model I trained using experiment code without cleaning, so there might be some differences that I didn't realize. Also, I'd like to make sure the code you modified for BigVGAN recipes has no problems. I was able to train a model on LJSpeech with the BIGVGAN recipe with similar quality using the code from this repo, but I haven't tried it on LirbiTTS, so I didn't update the repo with it. But you can refer to this branch with the pre-trained text aligner and pitch extractor for how to modify the references. Once I finish the paper on the E2E version of StyleTTS I will update this repo with BIGVGAN recipes. |
Thanks for your response! |
Hey im trying to add my BigVGAN vocoder model to the inferencing script,. But when it generates audio it always has a lot of noise, compared to the inferencing script of the original BigVGAN code base. Any Ideas on why that could be? It looks to be the same setup as HiFi-GAN. https://github.com/NVIDIA/BigVGAN. If you would like one of my trained Models let me know ill give you DL link so you can test with it... as there are currently no available models.
Thanks in advance!
Also tried the original with the same result grain/noisy audio.
The text was updated successfully, but these errors were encountered: