Skip to content

[audio utils] fix fft_bin_width computation #1274

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Merged
merged 1 commit into from
Apr 9, 2025

Conversation

xenova
Copy link
Collaborator

@xenova xenova commented Apr 6, 2025

Copied from huggingface/transformers#36603:

When computing triangular mel filter matrices and when we triangularize in mel space, if we have e.g. num_frequency_bins being 257, meaning real-valued fft as been computed on 512 points (257 = n_fft // 2 + 1), then the fft_bin_width should be (sampling rate / 2) / number_of_bins with number_of_bins = num_frequency_bins - 1.

This was very likely introduced and not seen by tests for the reason that mel_filter_bank was called with, following the above example, num_frequency_bins = 256 when using triangularize_in_mel_space=True then padded with 0s to retrieve the correct expected shape (257), which is a bad practice and misleading for the user as we do not respect the method API!

I also updated the expected outputs with the one I got from running torchaudio kaldi implementation, confirming our implem was incorrect.

Thanks @eustlb for the original fix!

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@xenova xenova merged commit 10c09fb into main Apr 9, 2025
4 checks passed
@xenova xenova deleted the fix-triangularise-mel-space branch April 9, 2025 19:10
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants