-
Notifications
You must be signed in to change notification settings - Fork 281
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
initial script for automating the creation of a controlled testing en… #2057
base: main
Are you sure you want to change the base?
Conversation
…vironment for OOVs
create_oovs.sh
Outdated
#!/bin/bash | ||
set -e | ||
|
||
stag=1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unused?
create_oovs.sh
Outdated
exit 1 | ||
fi | ||
|
||
step=1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this shouldn't be hardcoded? Or is it meant as a development tool, so you iterate on the parts as you get them working?
create_oovs.sh
Outdated
echo "Step 1: Preparing Data" | ||
if [ $step -le 1 ]; then | ||
|
||
# Extract corpus unique vocabularies |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# Extract corpus unique vocabularies | |
# Extract corpus vocabulary (unique words) |
create_oovs.sh
Outdated
sed 's/ /\n/g' tmp/data.txt | sort | uniq -c | sort -nr > tmp/vocab.txt | ||
grep -o . tmp/vocab.txt | sort -u > tmp/alphabet.txt | ||
|
||
# Pick the least frequent 10% vocabularies to represent OOVs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# Pick the least frequent 10% vocabularies to represent OOVs | |
# Pick the least frequent 10% words to build OOV set |
create_oovs.sh
Outdated
grep -o . tmp/vocab.txt | sort -u > tmp/alphabet.txt | ||
|
||
# Pick the least frequent 10% vocabularies to represent OOVs | ||
oov_count=$(wc tmp/vocab.txt | awk '{print int($0*0.1)}') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use wc -l
to communicate intent earlier.
oov_count=$(wc tmp/vocab.txt | awk '{print int($0*0.1)}') | |
oov_count=$(wc -l tmp/vocab.txt | awk '{print int($0*0.1)}') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Size of OOV set should be a parameter (with default).
create_oovs.sh
Outdated
|
||
# Prepare OOV csv for testing purposes (to assess imporvements on it) | ||
grep -wFf tmp/oov_sents tmp/data.txt > tmp/oov_corpus.txt | ||
grep -wFf tmp/oov_sents $data | sed '1 i\wav_filename,wav_filesize,transcript' > tmp/oov_corpus.csv |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sed
command doesn't work on macOS:
sed: 1: "1 i\wav_filename,wav_fi ...": extra characters after \ at the end of i command
Can we make it portable to BSD sed? This fix worked for me:
grep -wFf tmp/oov_sents $data | sed '1 i\wav_filename,wav_filesize,transcript' > tmp/oov_corpus.csv | |
echo "wav_filename,wav_filesize,transcript" > tmp/oov_corpus.csv | |
grep -wFf tmp/oov_sents $data >> tmp/oov_corpus.csv |
create_oovs.sh
Outdated
fi | ||
|
||
# Generate LM | ||
echo "Step 2: Generaing Language Model" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
echo "Step 2: Generaing Language Model" | |
echo "Step 2: Generating Language Model" |
create_oovs.sh
Outdated
gzip -c tmp/scorer_corpus.txt > tmp/scorer_corpus.txt.gz | ||
grep -vf tmp/oov_sents $data > tmp/scorer_corpus.csv | ||
|
||
# Prepare OOV csv for testing purposes (to assess imporvements on it) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# Prepare OOV csv for testing purposes (to assess imporvements on it) | |
# Prepare OOV CSV for testing purposes (to assess improvements on it) |
create_oovs.sh
Outdated
echo "Evaluating on OOV testing set." | ||
python -m coqui_stt_training.evaluate --test_files tmp/oov_corpus.csv \ | ||
--test_output_file tmp/results/oov_results.json --scorer_path native_client/kenlm.scorer \ | ||
--checkpoint_dir /home/aya/work/tmp/AM/coqui-stt-1.1.0-checkpoint --test_batch_size $nj |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Checkpoint path should be made into a parameter.
create_oovs.sh
Outdated
if [ $step -le 3 ]; then | ||
echo "Evaluating on OOV testing set." | ||
python -m coqui_stt_training.evaluate --test_files tmp/oov_corpus.csv \ | ||
--test_output_file tmp/results/oov_results.json --scorer_path native_client/kenlm.scorer \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The native_client/kenlm.scorer
should be kenlm.scorer
, according to the command in the step above, right? And that should probably be changed to tmp/kenlm.scorer
to keep all the outputs of the script contained to that folder.
No description provided.