LibriVox Spanish Recreation

This repository contains information related to the Speech Corpus LibriVox Spanish created by CIEMPIESS, including a mapping between the manually annotated resources and its original source in LibriVox, also some manual annotations using point symbols

Process

Corpus information

First we extract all information from files/Speaker_Info.xls in the LibriVox Spanish corpus into corpus_info/corpus_info.csv

Then we created a copy of corpus_info as corpus_info/corpus_info_formatted.csv which includes an additional column for total_seconds using scripts/transform_time_column.py

After we sort that file using the newly created column total_seconds and store the content into the file corpus_info/corpus_info_formatted_sorted.csv using the scripts/sort_by_time_column.py

Manually we add new columns for useful information as chapter_name, librivox_book_url, audio_url, text_url and speaker_url

Data preparation

Texts

We manually downloaded the text using the reference in the column text_url from corpus_info/corpus_info_formatted_sorted.csv

Also we performed a first manual cleaning, putting the title in the first line and delimiting using a period.

All those files are stored in original_text

We tried an automatic tokenization using nltk in the scripts scripts/download_nltk_data.py and scripts/tokenize_texts_nltk.py, however the tokenization didn't uses special characters as ? ; ! and others leaving a too broad segmentation. For that reason we implement our custom splitter defined in scripts/tokenize_texts.py which uses the following expression to tokenize the texts [x for x in re.split("\.|,|;|:|\n|!|¿|¡|\?|-|—\(|\)", text) if x.replace(" ", "")]

All tokenized text are stored in tokenized_text folder and each file has a format number: text. This number will be used as a sentence identifier in the following annotation process.

Audios

We use scripts/download_localized_audios.py to download the files from the audio_url column in corpus_info/corpus_info_formatted_sorted.csv

And we transform all audios using sox and scripts/transform_mp3_to_wav.bash

Segmentation

Al segmented files are stored in the annotations folder, where each file corresponds to a recording. Files are stored in TextGrid format

Troubleshooting

Before execute any code make sure you have python 3.6+ installed and a virtual environment

python -m venv librivox_spanish_recreation_env
source librivox_spanish_recreation_env
pip install requirements.txt

To install sox, check your software package manager. If using ubuntu

sudo apt install sox
sudo apt install libsox-fmt-mp3

Notes

Segmento audio y encuentro silencio cuando hay un signo de puntuacion True Positive Silencio corresponde con signo de puntuacion False Negatives No hay silencio pero si hay signo de puntuacion Falsos positivos hay silencio en la grabacion pero no hay signo de puntuacion True Negative

Negativo Senal Posirivo Silencio

Regularizar a 0.01 threshold

5 microcorpus a 20 grabacion

Particion a nivel de segmentos por corpus, agarrando los de mas de 10

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
annotations		annotations
automatic_segmentation		automatic_segmentation
corpus_info		corpus_info
evaluate_segmentation		evaluate_segmentation
fixed_annotations		fixed_annotations
original_text		original_text
scripts		scripts
silence_aligner		silence_aligner
silence_baseline		silence_baseline
tokenized_text		tokenized_text
.gitignore		.gitignore
LIBRIVOX_SPANISH.transcription		LIBRIVOX_SPANISH.transcription
README.md		README.md
automatic_filling.log		automatic_filling.log
create_files.py		create_files.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LibriVox Spanish Recreation

Process

Corpus information

Data preparation

Texts

Audios

Segmentation

Troubleshooting

Notes

About

Releases

Packages

Languages

open-speech-org/librivox-spanish-recreation

Folders and files

Latest commit

History

Repository files navigation

LibriVox Spanish Recreation

Process

Corpus information

Data preparation

Texts

Audios

Segmentation

Troubleshooting

Notes

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages