Skip to content

open-speech-org/librivox-spanish-recreation

Repository files navigation

LibriVox Spanish Recreation

This repository contains information related to the Speech Corpus LibriVox Spanish created by CIEMPIESS, including a mapping between the manually annotated resources and its original source in LibriVox, also some manual annotations using point symbols

Process

Corpus information

First we extract all information from files/Speaker_Info.xls in the LibriVox Spanish corpus into corpus_info/corpus_info.csv

Then we created a copy of corpus_info as corpus_info/corpus_info_formatted.csv which includes an additional column for total_seconds using scripts/transform_time_column.py

After we sort that file using the newly created column total_seconds and store the content into the file corpus_info/corpus_info_formatted_sorted.csv using the scripts/sort_by_time_column.py

Manually we add new columns for useful information as chapter_name, librivox_book_url, audio_url, text_url and speaker_url

Data preparation

Texts

We manually downloaded the text using the reference in the column text_url from corpus_info/corpus_info_formatted_sorted.csv

Also we performed a first manual cleaning, putting the title in the first line and delimiting using a period.

All those files are stored in original_text

We tried an automatic tokenization using nltk in the scripts scripts/download_nltk_data.py and scripts/tokenize_texts_nltk.py, however the tokenization didn't uses special characters as ? ; ! and others leaving a too broad segmentation. For that reason we implement our custom splitter defined in scripts/tokenize_texts.py which uses the following expression to tokenize the texts [x for x in re.split("\.|,|;|:|\n|!|¿|¡|\?|-|—\(|\)", text) if x.replace(" ", "")]

All tokenized text are stored in tokenized_text folder and each file has a format number: text. This number will be used as a sentence identifier in the following annotation process.

Audios

We use scripts/download_localized_audios.py to download the files from the audio_url column in corpus_info/corpus_info_formatted_sorted.csv

And we transform all audios using sox and scripts/transform_mp3_to_wav.bash

Segmentation

Al segmented files are stored in the annotations folder, where each file corresponds to a recording. Files are stored in TextGrid format

Troubleshooting

Before execute any code make sure you have python 3.6+ installed and a virtual environment

python -m venv librivox_spanish_recreation_env
source librivox_spanish_recreation_env
pip install requirements.txt

To install sox, check your software package manager. If using ubuntu

sudo apt install sox
sudo apt install libsox-fmt-mp3

Notes

Segmento audio y encuentro silencio cuando hay un signo de puntuacion True Positive Silencio corresponde con signo de puntuacion False Negatives No hay silencio pero si hay signo de puntuacion Falsos positivos hay silencio en la grabacion pero no hay signo de puntuacion True Negative

Negativo Senal Posirivo Silencio

Regularizar a 0.01 threshold

5 microcorpus a 20 grabacion

Particion a nivel de segmentos por corpus, agarrando los de mas de 10

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published