-
Notifications
You must be signed in to change notification settings - Fork 110
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Any to many vc, how to improve the speech intelligence for arbitrary inputs? #51
Comments
I believe it could simply be because there's not enough training data. Any-to-many conversion requires a lot of input data for the model to generalize well. |
And to generealise well, we need to have multiple discriminators (e.g. 1 per 10 speakers), as discussed in another topic? |
I have already used 200 speakers, each of which is around 15~20 minutes... |
@skol101 I don't believe so, if it's not for any-to-any, you only need to have a lot of input speakers. You do not need cycle loss in this case, because you don't really need these many output speakers. One thing you can do is modify the cycle loss to match the encoder output instead of the decoder output (i.e., the same speech should have the same encoded representations before and after conversion). @Kristopher-Chen You say sometimes drop, so what are the cases that drop and what are the cases that are good? |
@yl4579 cheers! How about shared projection as per #6 (comment) ? Is it applicable to any-to-many conversion? |
@skol101 I don't think you need this either, it is to make the style encoder speaker-independent so you can convert to any output speaker. If you are only interested in any-to-many, this is not necessary. |
@yl4579 https://drive.google.com/drive/folders/1SGBJllEvWg9a70qJf5DZhTVT_E5bl-w0
|
I have tried a any-to-many mapping solely based on this amazing project, and it works well for some speakers instead of all. I used 10 speakers and 20 mins audio per speaker, and the hyperparameters are the same as the original. At epoch 248, two speakers work excellently, both of them can convert Sichuan dialect Chinese though they are trained in Mandarin, and both of them can handle any-to-many conversion task. At epoch 466, I get 5 speakers who work perfectly and the conversion quality for all speakers has been promoted a lot. From my experience, u can keep training and wait. The training data is vitial important for this task, the high data quality prones to better performance of the speakers. However, the quality can't guarantee a better performance since I use both Lijian Zhao and Chunying Hua as the speakers and Chunying Hua works well at epoch 248 while Lijian Zhao not. |
@thsno02 what vocoder have you used? |
@skol101 the original one, and use mapping network rather than style |
Interesting, it was reported elsewhere that style encoder is better at VC than mapping network. Also, you haven't fine tuned the vocoder to your dataset? |
I haven’t tried any fine tuning effort due to the time frame. I did a lot experiments about the model performance, my conclusion is the mapping network tends to perform better in any-to-many task than style network, while styple network would convert audio with more linguistic information and more fluent sometimes. Meanwhile, in my scenario, either mapping network or style network can convert the audio in high quality consistently. This phenomenon kills me, and I have not figured it out. There have many potential reasons for this:
Tips: I have trained 742 epoches, while the model generilization does not change and I still only get 2 available speakers. |
Have you trained them (mapping and style) both or separately? |
both |
@Kristopher-Chen |
so, do i need compute loss of encoder output before add F0? |
When testing arbitrary inputs for any-to-many vc cases, the speech intelligence sometimes drops, where some phonemes cannot be well pronounced or sounds blur. It seems there are no other explicit constraints in this except for the ASR (or more specifically, PPG) loss. Any ideas to improve this problem?
The text was updated successfully, but these errors were encountered: