This repository contains information related to the Speech Corpus LibriVox Spanish created by CIEMPIESS, including a mapping between the manually annotated resources and its original source in LibriVox, also some manual annotations using point symbols
First we extract all information from files/Speaker_Info.xls in the LibriVox Spanish corpus into corpus_info/corpus_info.csv
Then we created a copy of corpus_info as corpus_info/corpus_info_formatted.csv which includes an additional column for total_seconds using scripts/transform_time_column.py
After we sort that file using the newly created column total_seconds and store the content into the file corpus_info/corpus_info_formatted_sorted.csv using the scripts/sort_by_time_column.py
Manually we add new columns for useful information as chapter_name, librivox_book_url, audio_url, text_url and speaker_url
We manually downloaded the text using the reference in the column text_url from corpus_info/corpus_info_formatted_sorted.csv
Also we performed a first manual cleaning, putting the title in the first line and delimiting using a period.
All those files are stored in original_text
We tried an automatic tokenization using nltk in the scripts scripts/download_nltk_data.py
and scripts/tokenize_texts_nltk.py, however the tokenization didn't uses special characters
as ? ; !
and others leaving a too broad segmentation. For that reason we implement our custom splitter defined in scripts/tokenize_texts.py
which uses the following expression to tokenize the texts [x for x in re.split("\.|,|;|:|\n|!|¿|¡|\?|-|—\(|\)", text) if x.replace(" ", "")]
All tokenized text are stored in tokenized_text folder and each file has a format number: text
. This
number will be used as a sentence identifier in the following annotation process.
We use scripts/download_localized_audios.py to download the files from the audio_url column in corpus_info/corpus_info_formatted_sorted.csv
And we transform all audios using sox and scripts/transform_mp3_to_wav.bash
Al segmented files are stored in the annotations folder, where each file corresponds to a recording. Files are stored in TextGrid format
Before execute any code make sure you have python 3.6+ installed and a virtual environment
python -m venv librivox_spanish_recreation_env
source librivox_spanish_recreation_env
pip install requirements.txt
To install sox, check your software package manager. If using ubuntu
sudo apt install sox
sudo apt install libsox-fmt-mp3
Segmento audio y encuentro silencio cuando hay un signo de puntuacion True Positive Silencio corresponde con signo de puntuacion False Negatives No hay silencio pero si hay signo de puntuacion Falsos positivos hay silencio en la grabacion pero no hay signo de puntuacion True Negative
Negativo Senal Posirivo Silencio
Regularizar a 0.01 threshold
5 microcorpus a 20 grabacion
Particion a nivel de segmentos por corpus, agarrando los de mas de 10