Skip to content
This repository has been archived by the owner on Mar 8, 2023. It is now read-only.

Voxforge bad samples, help for cleaning up #111

Open
12 tasks done
nefastosaturo opened this issue Dec 11, 2020 · 3 comments
Open
12 tasks done

Voxforge bad samples, help for cleaning up #111

nefastosaturo opened this issue Dec 11, 2020 · 3 comments
Labels
help wanted Extra attention is needed

Comments

@nefastosaturo
Copy link
Collaborator

nefastosaturo commented Dec 11, 2020

EDIT:

So, with some audio analysis we found some ugly speakers but for all the other speakers a manual check is needed.

If you want to help, please:

  1. choose a speaker from here: http://www.voxforge.org/it/Downloads (the optimal is to choose a speaker that had recorded lot of minutes)
  2. download its archive from here: http://www.repository.voxforge1.org/downloads/it/Trunk/Audio/Main/16kHz_16bit/
  3. listen to the audios and tell us if it is valid or not, which segment is not valid or everything from that speaker must be discarded

A valid audio must contain speech, even with very low volume and must be understandable.
For example Vistaus-20080718-mrm is not a valid one

DONE!

I've found some bad samples in this dataset. So I've just search for audio files with an average RMS below 0.025 value and I found these speakers that need to be checked:

  • anonymous-20080504-qvg - NO
  • anonymous-20080723-ouv - NO
  • anonymous-20080725-dey - NO
  • anonymous-20110605-kpd
  • anonymous-20170303-mwy
  • dario-20110426-yhj
  • Karm-20131225-irq
  • nannioz-20091103-qfc - ok
  • nannioz-20091103-raj - ok
  • nannioz-20091103-vkr - ok
  • nannioz-20091103-zhz - ok
  • Stefano-20150131-pus - ok

Also there is one speaker that is not italian and I'll remove it:

Vistaus-20080718-mrm

So, I'm asking you if you can choose two speakers, listen to their recordings and report if there is something VERY wrong (eg we can keep very-low volume but understandable recordings ).

You'll find all the recordings here http://www.repository.voxforge1.org/downloads/it/Trunk/Audio/Main/16kHz_16bit/

A csv containing all the samples with their RMS is attached
voxforge_bad_samples.zip

@nefastosaturo nefastosaturo added the help wanted Extra attention is needed label Dec 11, 2020
@dag7dev
Copy link
Member

dag7dev commented Dec 13, 2020

Controllati nannioz e Stefano: aggiorno l'issue e qui sotto.

nannioz-20091103-qfc - ok
nannioz-20091103-raj - ok
nannioz-20091103-vkr - ok
nannioz-20091103-zhz - ok
Stefano-20150131-pus - ok

@nefastosaturo
Copy link
Collaborator Author

ok so until we do not find other strange samples, we are done here. I'm leaving the issue open for future checks

@eziolotta
Copy link
Contributor

anonymous-20080725-dey - NO - EMPTY AUDIO
anonymous-20110605-kpd - OK - low-volume but understandable audio
anonymous-20170303-mwy - OK - low-volume but understandable audio
dario-20110426-yhj - OK
Karm-20131225-irq - OK

# for free to subscribe to this conversation on GitHub. Already have an account? #.
Labels
help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

3 participants