Log on f0 #17

markrmiller · 2024-07-31T05:18:13Z

I see in the inference file, the log of f0 is taken, but then it's not used. I was wondering if that's a bug and you take the log for training too? I've experimented a bit with both (at 24khz) and didn't train all the way but didn't notice much difference.

FYI, I also tried using Soft Hubert fine tuned on my dataset, taking the hidden state from the 12th layer, and the wav2vec 2.0 model you used crushes it.

hayeong0 · 2024-07-31T05:54:11Z

Thank you for pointing out the code duplication that occurred due to duplication of work in code organization and subsequent paper work.
We use F0 extracted with YAAPT, normalized per speaker, and then extract the F0 code through the F0 quantizer.
So, we have deleted the confusing line.

Thank you for sharing your experiences with various experiments!
We have tried using SSL models such as WavLM, XLS-R, and MMS, which have brought advantages in terms of scalability.

markrmiller · 2024-12-21T18:23:06Z

Thanks for clearing this up.

I ended up using Crepe for f0, but I don't think it made a significant difference and is a lot more expensive to create all the embedding for a lot of training data.

I've tried a LOT of different things at this point, using 24000hz, but most of them didn't have much impact. I was hoping that using Diffusers UNET with full attention and a bunch transformer blocks would have more impact than it seemed to. Same with the Diffusers schedulers - when using latents, they are so small, that being able to use fewer steps doesn't matter much. I also tried a Perceiver style encoder that uses flash attention and more feature rich embeddings, but didn't seem to have a big impact.

The only thing that's really made a decent impact is using a VQGAN with 1 channel and bin wise zscoring to create latents. That allows for much larger batch sizes and faster training, and output melspectrograms are pretty much immediately excellent. It doesn't cut down on the number of steps you need though, I assume because the training becomes almost entirely about learning all the mixup combos, and the simpler, smaller structures don't matter much compared to seeing enough combinations.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Log on f0 #17

Log on f0 #17

markrmiller commented Jul 31, 2024

hayeong0 commented Jul 31, 2024

markrmiller commented Dec 21, 2024

Log on f0 #17

Log on f0 #17

Comments

markrmiller commented Jul 31, 2024

hayeong0 commented Jul 31, 2024

markrmiller commented Dec 21, 2024