-
Notifications
You must be signed in to change notification settings - Fork 1k
Data Preparation
For training a Speech Recognition model using wav2letter++, we typically expect the following inputs
- Audio and Transcriptions data
- Token dictionary
- Lexicon
- Language Model
wav2letter++ expects audio and transcription data to be prepared in a specific format so that they can be read from the pipelines. Each dataset (test/valid/train) needs to be in a separate file with one sample per line. A sample is specified using 4 columns separated by space (or tabs).
-
sample_id
- unique id for the sample -
input_handle
- input audio file path. -
size
- a real number used for sorting the dataset (typically audio duration in milliseconds). -
transcription
- target word transcription for this sample
The directories for the dataset is specified using -datadir
and files are specified
with -train
, -valid
and -test
corresponding to training, validation and test sets.
// Example input file format
[~/speech/data] head train.lst
train001 /tmp/000000000.flac 100.03 this is sparta
train002 /tmp/000000001.flac 360.57 coca cola
train003 /tmp/000000002.flac 123.53 hello world
train004 /tmp/000000003.flac 999.99 quick brown fox jumped
...
...
We use sndfile for loading the audio files.
It supports many different formats including .wav, .flac etc.
For samplerate, 16KHz is the default option but you can specify a different one using -samplerate
flag.
Note that, we require all the train/valid/test data to have the same samplerate
for now.
A token dictionary file consists of a list of all subword units (graphemes / phonemes /...) used. If we are using graphemes, a typical token dictionary file would look like this
# tokens.txt
|
'
a
b
c
...
... (and so on)
z
The symbol "|" is used to denote space between words.
Note that it is possible to add additional symbols like N
for noise, L
for laughter and more depending on the dataset.
If two tokens on the same line in tokens file, they are mapped to the same index for training/decoding.
A lexicon file consists of mapping from words to their sequence of tokens representation. Each line will have word followed by its space-split tokens. Space or tab separate a word from its sequence of tokens. Here is an example of grapheme based lexicon.
# lexicon.txt
a a |
able a b l e |
about a b o u t |
above a b o v e |
...
hello-kitty h e l l o | k i t t y |
...
... (and so on)
If a pre-trained LM is not available for the training data, you can use KenLM for training an N-gram language model. It is also recommended to convert arpa files to the binary format for faster loading.
The wav2letter++ decoder is generic enough to plugin N-gram, Convolutional LMs, RNN LMs etc. The later will be integrated in a later update.