-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Log on f0 #17
Comments
Thank you for pointing out the code duplication that occurred due to duplication of work in code organization and subsequent paper work. Thank you for sharing your experiences with various experiments! |
Thanks for clearing this up. I ended up using Crepe for f0, but I don't think it made a significant difference and is a lot more expensive to create all the embedding for a lot of training data. I've tried a LOT of different things at this point, using 24000hz, but most of them didn't have much impact. I was hoping that using Diffusers UNET with full attention and a bunch transformer blocks would have more impact than it seemed to. Same with the Diffusers schedulers - when using latents, they are so small, that being able to use fewer steps doesn't matter much. I also tried a Perceiver style encoder that uses flash attention and more feature rich embeddings, but didn't seem to have a big impact. The only thing that's really made a decent impact is using a VQGAN with 1 channel and bin wise zscoring to create latents. That allows for much larger batch sizes and faster training, and output melspectrograms are pretty much immediately excellent. It doesn't cut down on the number of steps you need though, I assume because the training becomes almost entirely about learning all the mixup combos, and the simpler, smaller structures don't matter much compared to seeing enough combinations. |
I see in the inference file, the log of f0 is taken, but then it's not used. I was wondering if that's a bug and you take the log for training too? I've experimented a bit with both (at 24khz) and didn't train all the way but didn't notice much difference.
FYI, I also tried using Soft Hubert fine tuned on my dataset, taking the hidden state from the 12th layer, and the wav2vec 2.0 model you used crushes it.
The text was updated successfully, but these errors were encountered: