initial script for automating the creation of a controlled testing en… #2057

Aya-AlJafari · 2021-12-28T15:51:39Z

No description provided.

…vironment for OOVs

reuben · 2021-12-29T12:22:04Z

create_oovs.sh

+#!/bin/bash
+set -e
+
+stag=1


reuben · 2021-12-29T12:22:12Z

create_oovs.sh

+    exit 1
+fi
+
+step=1


I guess this shouldn't be hardcoded? Or is it meant as a development tool, so you iterate on the parts as you get them working?

reuben · 2021-12-29T12:23:25Z

create_oovs.sh

+echo "Step 1: Preparing Data"
+if [ $step -le 1 ]; then
+
+    # Extract corpus unique vocabularies


Suggested change

# Extract corpus unique vocabularies

# Extract corpus vocabulary (unique words)

reuben · 2021-12-29T12:24:49Z

create_oovs.sh

+    sed 's/ /\n/g' tmp/data.txt | sort | uniq -c | sort -nr > tmp/vocab.txt
+    grep -o . tmp/vocab.txt | sort -u > tmp/alphabet.txt
+
+    # Pick the least frequent 10% vocabularies to represent OOVs


Suggested change

# Pick the least frequent 10% vocabularies to represent OOVs

# Pick the least frequent 10% words to build OOV set

reuben · 2021-12-29T12:26:39Z

create_oovs.sh

+    grep -o . tmp/vocab.txt | sort -u > tmp/alphabet.txt
+
+    # Pick the least frequent 10% vocabularies to represent OOVs
+    oov_count=$(wc tmp/vocab.txt | awk '{print int($0*0.1)}')


use wc -l to communicate intent earlier.

Suggested change

oov_count=$(wc tmp/vocab.txt | awk '{print int($0*0.1)}')

oov_count=$(wc -l tmp/vocab.txt | awk '{print int($0*0.1)}')

Size of OOV set should be a parameter (with default).

reuben · 2021-12-29T13:03:31Z

create_oovs.sh

+
+    # Prepare OOV csv for testing purposes (to assess imporvements on it)
+    grep -wFf tmp/oov_sents tmp/data.txt > tmp/oov_corpus.txt
+    grep -wFf tmp/oov_sents $data | sed '1 i\wav_filename,wav_filesize,transcript' > tmp/oov_corpus.csv


This sed command doesn't work on macOS:

sed: 1: "1 i\wav_filename,wav_fi ...": extra characters after \ at the end of i command

Can we make it portable to BSD sed? This fix worked for me:

Suggested change

grep -wFf tmp/oov_sents $data | sed '1 i\wav_filename,wav_filesize,transcript' > tmp/oov_corpus.csv

echo "wav_filename,wav_filesize,transcript" > tmp/oov_corpus.csv

grep -wFf tmp/oov_sents $data >> tmp/oov_corpus.csv

reuben · 2021-12-29T15:45:44Z

create_oovs.sh

+fi
+
+# Generate LM
+echo "Step 2: Generaing Language Model"


Suggested change

echo "Step 2: Generaing Language Model"

echo "Step 2: Generating Language Model"

reuben · 2021-12-29T15:45:58Z

create_oovs.sh

+    gzip -c tmp/scorer_corpus.txt > tmp/scorer_corpus.txt.gz
+    grep -vf tmp/oov_sents $data > tmp/scorer_corpus.csv
+
+    # Prepare OOV csv for testing purposes (to assess imporvements on it)


Suggested change

# Prepare OOV csv for testing purposes (to assess imporvements on it)

# Prepare OOV CSV for testing purposes (to assess improvements on it)

reuben · 2021-12-29T15:47:11Z

create_oovs.sh

+    echo "Evaluating on OOV testing set."
+    python -m coqui_stt_training.evaluate --test_files tmp/oov_corpus.csv \
+        --test_output_file tmp/results/oov_results.json --scorer_path native_client/kenlm.scorer \
+        --checkpoint_dir /home/aya/work/tmp/AM/coqui-stt-1.1.0-checkpoint --test_batch_size $nj


Checkpoint path should be made into a parameter.

reuben · 2021-12-29T15:49:34Z

create_oovs.sh

+if [ $step -le 3 ]; then
+    echo "Evaluating on OOV testing set."
+    python -m coqui_stt_training.evaluate --test_files tmp/oov_corpus.csv \
+        --test_output_file tmp/results/oov_results.json --scorer_path native_client/kenlm.scorer \


The native_client/kenlm.scorer should be kenlm.scorer, according to the command in the step above, right? And that should probably be changed to tmp/kenlm.scorer to keep all the outputs of the script contained to that folder.

initial script for automating the creation of a controlled testing en…

420f6ac

…vironment for OOVs

Aya-AlJafari requested a review from reuben December 28, 2021 15:51

Fix whitespace

53b60be

reuben reviewed Dec 29, 2021

View reviewed changes

Aya-AlJafari added 6 commits December 30, 2021 18:56

split the script into two, addressing of feedback comments

0e2c6fd

Merge branch 'main' into feature-1949-scorer-oov

33a3c9c

OOV/space potential fixes

a57add6

adding header files

8ac9840

adding some debug code

de17821

WIP: adding penalty for OOV beams

5e6e86e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

initial script for automating the creation of a controlled testing en… #2057

initial script for automating the creation of a controlled testing en… #2057

Aya-AlJafari commented Dec 28, 2021

reuben Dec 29, 2021

reuben Dec 29, 2021

reuben Dec 29, 2021

reuben Dec 29, 2021

reuben Dec 29, 2021

reuben Dec 29, 2021

reuben Dec 29, 2021

reuben Dec 29, 2021

reuben Dec 29, 2021

reuben Dec 29, 2021

reuben Dec 29, 2021

+                  exit 1
+              fi
+              step=1

	# Extract corpus unique vocabularies
	# Extract corpus vocabulary (unique words)

	# Pick the least frequent 10% vocabularies to represent OOVs
	# Pick the least frequent 10% words to build OOV set

	oov_count=$(wc tmp/vocab.txt \| awk '{print int($0*0.1)}')
	oov_count=$(wc -l tmp/vocab.txt \| awk '{print int($0*0.1)}')

	grep -wFf tmp/oov_sents $data \| sed '1 i\wav_filename,wav_filesize,transcript' > tmp/oov_corpus.csv
	echo "wav_filename,wav_filesize,transcript" > tmp/oov_corpus.csv
	grep -wFf tmp/oov_sents $data >> tmp/oov_corpus.csv

	echo "Step 2: Generaing Language Model"
	echo "Step 2: Generating Language Model"

	# Prepare OOV csv for testing purposes (to assess imporvements on it)
	# Prepare OOV CSV for testing purposes (to assess improvements on it)

initial script for automating the creation of a controlled testing en… #2057

Are you sure you want to change the base?

initial script for automating the creation of a controlled testing en… #2057

Conversation

Aya-AlJafari commented Dec 28, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment