Any to many vc, how to improve the speech intelligence for arbitrary inputs？ #51

Kristopher-Chen · 2022-06-10T08:22:17Z

When testing arbitrary inputs for any-to-many vc cases, the speech intelligence sometimes drops, where some phonemes cannot be well pronounced or sounds blur. It seems there are no other explicit constraints in this except for the ASR (or more specifically, PPG) loss. Any ideas to improve this problem?

yl4579 · 2022-06-14T01:21:45Z

I believe it could simply be because there's not enough training data. Any-to-many conversion requires a lot of input data for the model to generalize well.

skol101 · 2022-06-14T04:35:41Z

And to generealise well, we need to have multiple discriminators (e.g. 1 per 10 speakers), as discussed in another topic?

Kristopher-Chen · 2022-06-14T07:49:16Z

I believe it could simply be because there's not enough training data. Any-to-many conversion requires a lot of input data for the model to generalize well.

I have already used 200 speakers, each of which is around 15~20 minutes...

yl4579 · 2022-06-14T19:23:02Z

@skol101 I don't believe so, if it's not for any-to-any, you only need to have a lot of input speakers. You do not need cycle loss in this case, because you don't really need these many output speakers. One thing you can do is modify the cycle loss to match the encoder output instead of the decoder output (i.e., the same speech should have the same encoded representations before and after conversion).

@Kristopher-Chen You say sometimes drop, so what are the cases that drop and what are the cases that are good?

skol101 · 2022-06-14T20:13:48Z

@yl4579 cheers! How about shared projection as per #6 (comment) ? Is it applicable to any-to-many conversion?

yl4579 · 2022-06-15T05:17:48Z

@skol101 I don't think you need this either, it is to make the style encoder speaker-independent so you can convert to any output speaker. If you are only interested in any-to-many, this is not necessary.

Kristopher-Chen · 2022-06-15T07:10:08Z

@skol101 I don't believe so, if it's not for any-to-any, you only need to have a lot of input speakers. You do not need cycle loss in this case, because you don't really need these many output speakers. One thing you can do is modify the cycle loss to match the encoder output instead of the decoder output (i.e., the same speech should have the same encoded representations before and after conversion).

@Kristopher-Chen You say sometimes drop, so what are the cases that drop and what are the cases that are good?

@yl4579 https://drive.google.com/drive/folders/1SGBJllEvWg9a70qJf5DZhTVT_E5bl-w0
I trained with 200 Chinese speakers. There is an example here. The input is recorded by PC, and I tried to convert it to one male and one female by 50 and 120 epochs. The points are,

the male sounds noisier than the female;
I mean the speech intelligence got worse when something happens like the 120 epoch outputs. But the interesting thing is that the results of the 50 epoch seem better than those of the 120 epoch. I just could not figure it out.

thsno02 · 2022-07-06T10:05:10Z

I have tried a any-to-many mapping solely based on this amazing project, and it works well for some speakers instead of all. I used 10 speakers and 20 mins audio per speaker, and the hyperparameters are the same as the original.

At epoch 248, two speakers work excellently, both of them can convert Sichuan dialect Chinese though they are trained in Mandarin, and both of them can handle any-to-many conversion task.

At epoch 466, I get 5 speakers who work perfectly and the conversion quality for all speakers has been promoted a lot.

From my experience, u can keep training and wait. The training data is vitial important for this task, the high data quality prones to better performance of the speakers. However, the quality can't guarantee a better performance since I use both Lijian Zhao and Chunying Hua as the speakers and Chunying Hua works well at epoch 248 while Lijian Zhao not.

skol101 · 2022-07-20T12:21:59Z

@thsno02 what vocoder have you used?

thsno02 · 2022-07-21T06:27:38Z

@skol101 the original one, and use mapping network rather than style

skol101 · 2022-07-21T07:13:05Z

Interesting, it was reported elsewhere that style encoder is better at VC than mapping network.

Also, you haven't fine tuned the vocoder to your dataset?

thsno02 · 2022-07-22T08:56:24Z

I haven’t tried any fine tuning effort due to the time frame. I did a lot experiments about the model performance, my conclusion is the mapping network tends to perform better in any-to-many task than style network, while styple network would convert audio with more linguistic information and more fluent sometimes. Meanwhile, in my scenario, either mapping network or style network can convert the audio in high quality consistently. This phenomenon kills me, and I have not figured it out.

There have many potential reasons for this:

I use Chinese corpus instead of English to train the model, maybe fine tuning will help;
the model is sensitive to the volume, since in my model, the lower volume tends to yield bad conversion;
the quality of captured audio varies in different microphones;
the quality of my training data is low.

Tips: I have trained 742 epoches, while the model generilization does not change and I still only get 2 available speakers.

skol101 · 2022-07-22T13:41:44Z

Have you trained them (mapping and style) both or separately?

thsno02 · 2022-07-23T08:27:31Z

both

1nlplearner · 2023-02-16T07:29:09Z

@Kristopher-Chen
how many domains in you discriminator and how many discriminators

1nlplearner · 2023-02-23T09:05:18Z

@skol101 I don't believe so, if it's not for any-to-any, you only need to have a lot of input speakers. You do not need cycle loss in this case, because you don't really need these many output speakers. One thing you can do is modify the cycle loss to match the encoder output instead of the decoder output (i.e., the same speech should have the same encoded representations before and after conversion).

@Kristopher-Chen You say sometimes drop, so what are the cases that drop and what are the cases that are good?

so, do i need compute loss of encoder output before add F0?
what is the function of f0

yl4579 added the discussion New research topic label Sep 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Any to many vc, how to improve the speech intelligence for arbitrary inputs？ #51

Any to many vc, how to improve the speech intelligence for arbitrary inputs？ #51

Kristopher-Chen commented Jun 10, 2022

yl4579 commented Jun 14, 2022

skol101 commented Jun 14, 2022

Kristopher-Chen commented Jun 14, 2022

yl4579 commented Jun 14, 2022

skol101 commented Jun 14, 2022

yl4579 commented Jun 15, 2022

Kristopher-Chen commented Jun 15, 2022 •

edited

Loading

thsno02 commented Jul 6, 2022

skol101 commented Jul 20, 2022

thsno02 commented Jul 21, 2022

skol101 commented Jul 21, 2022 •

edited

Loading

thsno02 commented Jul 22, 2022 •

edited

Loading

skol101 commented Jul 22, 2022

thsno02 commented Jul 23, 2022

1nlplearner commented Feb 16, 2023 •

edited

Loading

1nlplearner commented Feb 23, 2023

Any to many vc, how to improve the speech intelligence for arbitrary inputs？ #51

Any to many vc, how to improve the speech intelligence for arbitrary inputs？ #51

Comments

Kristopher-Chen commented Jun 10, 2022

yl4579 commented Jun 14, 2022

skol101 commented Jun 14, 2022

Kristopher-Chen commented Jun 14, 2022

yl4579 commented Jun 14, 2022

skol101 commented Jun 14, 2022

yl4579 commented Jun 15, 2022

Kristopher-Chen commented Jun 15, 2022 • edited Loading

thsno02 commented Jul 6, 2022

skol101 commented Jul 20, 2022

thsno02 commented Jul 21, 2022

skol101 commented Jul 21, 2022 • edited Loading

thsno02 commented Jul 22, 2022 • edited Loading

skol101 commented Jul 22, 2022

thsno02 commented Jul 23, 2022

1nlplearner commented Feb 16, 2023 • edited Loading

1nlplearner commented Feb 23, 2023

Kristopher-Chen commented Jun 15, 2022 •

edited

Loading

skol101 commented Jul 21, 2022 •

edited

Loading

thsno02 commented Jul 22, 2022 •

edited

Loading

1nlplearner commented Feb 16, 2023 •

edited

Loading