Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

initial script for automating the creation of a controlled testing en… #2057

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

Aya-AlJafari
Copy link
Contributor

No description provided.

@Aya-AlJafari Aya-AlJafari requested a review from reuben December 28, 2021 15:51
create_oovs.sh Outdated
#!/bin/bash
set -e

stag=1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unused?

create_oovs.sh Outdated
exit 1
fi

step=1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this shouldn't be hardcoded? Or is it meant as a development tool, so you iterate on the parts as you get them working?

create_oovs.sh Outdated
echo "Step 1: Preparing Data"
if [ $step -le 1 ]; then

# Extract corpus unique vocabularies
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Extract corpus unique vocabularies
# Extract corpus vocabulary (unique words)

create_oovs.sh Outdated
sed 's/ /\n/g' tmp/data.txt | sort | uniq -c | sort -nr > tmp/vocab.txt
grep -o . tmp/vocab.txt | sort -u > tmp/alphabet.txt

# Pick the least frequent 10% vocabularies to represent OOVs
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Pick the least frequent 10% vocabularies to represent OOVs
# Pick the least frequent 10% words to build OOV set

create_oovs.sh Outdated
grep -o . tmp/vocab.txt | sort -u > tmp/alphabet.txt

# Pick the least frequent 10% vocabularies to represent OOVs
oov_count=$(wc tmp/vocab.txt | awk '{print int($0*0.1)}')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use wc -l to communicate intent earlier.

Suggested change
oov_count=$(wc tmp/vocab.txt | awk '{print int($0*0.1)}')
oov_count=$(wc -l tmp/vocab.txt | awk '{print int($0*0.1)}')

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Size of OOV set should be a parameter (with default).

create_oovs.sh Outdated

# Prepare OOV csv for testing purposes (to assess imporvements on it)
grep -wFf tmp/oov_sents tmp/data.txt > tmp/oov_corpus.txt
grep -wFf tmp/oov_sents $data | sed '1 i\wav_filename,wav_filesize,transcript' > tmp/oov_corpus.csv
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sed command doesn't work on macOS:

sed: 1: "1 i\wav_filename,wav_fi ...": extra characters after \ at the end of i command

Can we make it portable to BSD sed? This fix worked for me:

Suggested change
grep -wFf tmp/oov_sents $data | sed '1 i\wav_filename,wav_filesize,transcript' > tmp/oov_corpus.csv
echo "wav_filename,wav_filesize,transcript" > tmp/oov_corpus.csv
grep -wFf tmp/oov_sents $data >> tmp/oov_corpus.csv

create_oovs.sh Outdated
fi

# Generate LM
echo "Step 2: Generaing Language Model"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
echo "Step 2: Generaing Language Model"
echo "Step 2: Generating Language Model"

create_oovs.sh Outdated
gzip -c tmp/scorer_corpus.txt > tmp/scorer_corpus.txt.gz
grep -vf tmp/oov_sents $data > tmp/scorer_corpus.csv

# Prepare OOV csv for testing purposes (to assess imporvements on it)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Prepare OOV csv for testing purposes (to assess imporvements on it)
# Prepare OOV CSV for testing purposes (to assess improvements on it)

create_oovs.sh Outdated
echo "Evaluating on OOV testing set."
python -m coqui_stt_training.evaluate --test_files tmp/oov_corpus.csv \
--test_output_file tmp/results/oov_results.json --scorer_path native_client/kenlm.scorer \
--checkpoint_dir /home/aya/work/tmp/AM/coqui-stt-1.1.0-checkpoint --test_batch_size $nj
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checkpoint path should be made into a parameter.

create_oovs.sh Outdated
if [ $step -le 3 ]; then
echo "Evaluating on OOV testing set."
python -m coqui_stt_training.evaluate --test_files tmp/oov_corpus.csv \
--test_output_file tmp/results/oov_results.json --scorer_path native_client/kenlm.scorer \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The native_client/kenlm.scorer should be kenlm.scorer, according to the command in the step above, right? And that should probably be changed to tmp/kenlm.scorer to keep all the outputs of the script contained to that folder.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants