Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Any to many vc, how to improve the speech intelligence for arbitrary inputs? #51

Open
Kristopher-Chen opened this issue Jun 10, 2022 · 16 comments
Labels
discussion New research topic

Comments

@Kristopher-Chen
Copy link

When testing arbitrary inputs for any-to-many vc cases, the speech intelligence sometimes drops, where some phonemes cannot be well pronounced or sounds blur. It seems there are no other explicit constraints in this except for the ASR (or more specifically, PPG) loss. Any ideas to improve this problem?

@yl4579
Copy link
Owner

yl4579 commented Jun 14, 2022

I believe it could simply be because there's not enough training data. Any-to-many conversion requires a lot of input data for the model to generalize well.

@skol101
Copy link

skol101 commented Jun 14, 2022

And to generealise well, we need to have multiple discriminators (e.g. 1 per 10 speakers), as discussed in another topic?

@Kristopher-Chen
Copy link
Author

I believe it could simply be because there's not enough training data. Any-to-many conversion requires a lot of input data for the model to generalize well.

I have already used 200 speakers, each of which is around 15~20 minutes...

@yl4579
Copy link
Owner

yl4579 commented Jun 14, 2022

@skol101 I don't believe so, if it's not for any-to-any, you only need to have a lot of input speakers. You do not need cycle loss in this case, because you don't really need these many output speakers. One thing you can do is modify the cycle loss to match the encoder output instead of the decoder output (i.e., the same speech should have the same encoded representations before and after conversion).

@Kristopher-Chen You say sometimes drop, so what are the cases that drop and what are the cases that are good?

@skol101
Copy link

skol101 commented Jun 14, 2022

@yl4579 cheers! How about shared projection as per #6 (comment) ? Is it applicable to any-to-many conversion?

@yl4579
Copy link
Owner

yl4579 commented Jun 15, 2022

@skol101 I don't think you need this either, it is to make the style encoder speaker-independent so you can convert to any output speaker. If you are only interested in any-to-many, this is not necessary.

@Kristopher-Chen
Copy link
Author

Kristopher-Chen commented Jun 15, 2022

@skol101 I don't believe so, if it's not for any-to-any, you only need to have a lot of input speakers. You do not need cycle loss in this case, because you don't really need these many output speakers. One thing you can do is modify the cycle loss to match the encoder output instead of the decoder output (i.e., the same speech should have the same encoded representations before and after conversion).

@Kristopher-Chen You say sometimes drop, so what are the cases that drop and what are the cases that are good?

@yl4579 https://drive.google.com/drive/folders/1SGBJllEvWg9a70qJf5DZhTVT_E5bl-w0
I trained with 200 Chinese speakers. There is an example here. The input is recorded by PC, and I tried to convert it to one male and one female by 50 and 120 epochs. The points are,

  1. the male sounds noisier than the female;
  2. I mean the speech intelligence got worse when something happens like the 120 epoch outputs. But the interesting thing is that the results of the 50 epoch seem better than those of the 120 epoch. I just could not figure it out.

@thsno02
Copy link

thsno02 commented Jul 6, 2022

I have tried a any-to-many mapping solely based on this amazing project, and it works well for some speakers instead of all. I used 10 speakers and 20 mins audio per speaker, and the hyperparameters are the same as the original.

At epoch 248, two speakers work excellently, both of them can convert Sichuan dialect Chinese though they are trained in Mandarin, and both of them can handle any-to-many conversion task.

At epoch 466, I get 5 speakers who work perfectly and the conversion quality for all speakers has been promoted a lot.

From my experience, u can keep training and wait. The training data is vitial important for this task, the high data quality prones to better performance of the speakers. However, the quality can't guarantee a better performance since I use both Lijian Zhao and Chunying Hua as the speakers and Chunying Hua works well at epoch 248 while Lijian Zhao not.

@skol101
Copy link

skol101 commented Jul 20, 2022

@thsno02 what vocoder have you used?

@thsno02
Copy link

thsno02 commented Jul 21, 2022

@skol101 the original one, and use mapping network rather than style

@skol101
Copy link

skol101 commented Jul 21, 2022

Interesting, it was reported elsewhere that style encoder is better at VC than mapping network.

Also, you haven't fine tuned the vocoder to your dataset?

@thsno02
Copy link

thsno02 commented Jul 22, 2022

I haven’t tried any fine tuning effort due to the time frame. I did a lot experiments about the model performance, my conclusion is the mapping network tends to perform better in any-to-many task than style network, while styple network would convert audio with more linguistic information and more fluent sometimes. Meanwhile, in my scenario, either mapping network or style network can convert the audio in high quality consistently. This phenomenon kills me, and I have not figured it out.

There have many potential reasons for this:

  • I use Chinese corpus instead of English to train the model, maybe fine tuning will help;
  • the model is sensitive to the volume, since in my model, the lower volume tends to yield bad conversion;
  • the quality of captured audio varies in different microphones;
  • the quality of my training data is low.

Tips: I have trained 742 epoches, while the model generilization does not change and I still only get 2 available speakers.

@skol101
Copy link

skol101 commented Jul 22, 2022

Have you trained them (mapping and style) both or separately?

@thsno02
Copy link

thsno02 commented Jul 23, 2022

both

@yl4579 yl4579 added the discussion New research topic label Sep 18, 2022
@1nlplearner
Copy link

1nlplearner commented Feb 16, 2023

@Kristopher-Chen
how many domains in you discriminator and how many discriminators

@1nlplearner
Copy link

@skol101 I don't believe so, if it's not for any-to-any, you only need to have a lot of input speakers. You do not need cycle loss in this case, because you don't really need these many output speakers. One thing you can do is modify the cycle loss to match the encoder output instead of the decoder output (i.e., the same speech should have the same encoded representations before and after conversion).

@Kristopher-Chen You say sometimes drop, so what are the cases that drop and what are the cases that are good?

so, do i need compute loss of encoder output before add F0?
what is the function of f0

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
discussion New research topic
Projects
None yet
Development

No branches or pull requests

5 participants