Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Generate batch of LMs #2249

Closed
wants to merge 10 commits into from
Closed

Conversation

wasertech
Copy link
Collaborator

@wasertech wasertech commented Jul 4, 2022

@HarikalarKutusu had made a double of data/lm/generate_lm.py to create multiple LMs with only one command.

Unfortunately his implementation was rather lacking so I made the following changes:

  • added concurrence
  • removed copied code from original script and wrapped generate_lm_batch.py
  • black the code
  • added missing dependencies to docker
  • added ci test run-ci-lm-gen.sh to workflows/build-and-test.yml pipeline

So much so that you can now do the following.

python data/lm/generate_lm_batch.py \
    --input_txt /mnt/extracted/sources_lm.txt \
    --output_dir /mnt/lm/ \
    --top_k_list 30000-50000 \
    --arpa_order_list "2-3" \
    --max_arpa_memory "85%" \
    --arpa_prune_list "0|0|2-0|0|3" \
    --binary_a_bits 255 \
    --binary_q_bits 8 \
    --binary_type trie \
    --kenlm_bins /code/kenlm/build/bin/ \
    -j 12

This will test for all possible combinaison of :

{
    'top_k': [30000, 50000],
    'arpa_order': [2, 3],
    'arpa_prune': ["0|0|2", "0|0|3"]
}

The created scorers will be stored in {--output_path}/{arpa_order}-{top_k}-{arpa_prune}/.

# ./data/lm/4-30000-0|0|1
drwxr-xr-x root root  .
.rw-r--r-- root root lm.binary
.rw-r--r-- root root vocab-30000.txt

Needs libboost-program-options-dev and libboost-thread-dev installed or lmplz crashes with:

libboost_program_options.so.1.71.0: cannot open shared object file: No such file or directory
libboost_thread.so.1.71.0:  cannot open shared object file: No such file or directory

@wasertech
Copy link
Collaborator Author

Better!

@wasertech
Copy link
Collaborator Author

wasertech commented Jul 4, 2022

The output is a little messy since we do runs simultaneously so we need to report everything nicely at the end.

@wasertech
Copy link
Collaborator Author

wasertech commented Jul 4, 2022

root@c53e06a85b12:/code# ./bin/run-ci-lm-gen-batch.sh 
sources_lm_filepath=./data/smoke_test/vocab.txt
+ python data/lm/generate_lm_batch.py --input_txt ./data/smoke_test/vocab.txt --output_dir ./data/lm --top_k_list 30000 --arpa_order_list 4 --max_arpa_memory 85% --arpa_prune_list 0|0|2 --binary_a_bits 255 --binary_q_bits 8 --binary_type trie --kenlm_bins /code/kenlm/build/bin/ -j 1

Converting to lowercase and counting word occurrences ...
| |#                                                                                                                                                             | 500 Elapsed Time: 0:00:00

Saving top 30000 words ...

Calculating word statistics ...
  Your text file has 13343 words in total
  It has 2559 unique words
  Your top-30000 words are 100.0000 percent of all words
  Your most common word "the" occurred 687 times
  The least common word in your top-k is "ultraconservative" with 1 times
  The first word with 2 occurrences is "mens" at place 1146

Creating ARPA file ...
=== 1/5 Counting and sorting n-grams ===
Reading /code/data/lm/4-30000-0|0|2/lower.txt.gz
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
Unigram tokens 13343 types 2562
=== 2/5 Calculating and sorting adjusted counts ===
Chain sizes: 1:30744 2:14627018752 3:27425658880 4:43881058304
Statistics:
1 2562 D1=0.651407 D2=1.09117 D3+=1.64993
2 9399 D1=0.831861 D2=1.21647 D3+=1.44108
3 148/12347 D1=0.937292 D2=1.53845 D3+=1.55801
4 21/12584 D1=0.967272 D2=1.7362 D3+=3
Memory estimate for binary LM:
type     kB
probing 289 assuming -p 1.5
probing 355 assuming -r models -p 1.5
trie    156 without quantization
trie    107 assuming -q 8 -b 8 quantization 
trie    148 assuming -a 22 array pointer compression
trie     99 assuming -a 22 -q 8 -b 8 array pointer compression and quantization
=== 3/5 Calculating and sorting initial probabilities ===
Chain sizes: 1:30744 2:150384 3:2960 4:504
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
####################################################################################################
=== 4/5 Calculating and writing order-interpolated probabilities ===
Chain sizes: 1:30744 2:150384 3:2960 4:504
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
####################################################################################################
=== 5/5 Writing ARPA model ===
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
Name:lmplz      VmPeak:84649108 kB      VmRSS:6756 kB   RSSMax:16794516 kB      user:0.940238   sys:4.20439     CPU:5.14465     real:5.14232

Filtering ARPA file using vocabulary of top-k words ...
Reading ./data/lm/4-30000-0|0|2/lm.arpa
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************

Building lm.binary ...
Reading ./data/lm/4-30000-0|0|2/lm_filtered.arpa
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
Identifying n-grams omitted by SRI
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
Quantizing
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
Writing trie
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
SUCCESS
----------------------------------------------------------------
2022-07-04 13:32 RUNNING 1/1 FOR arpa_order=4 top_k=30000 arpa_prune='0|0|2'
LM generation 1 took: 5.443297207000796 seconds
----------------------------------------------------------------
INFO:root:Took 5.445083366999825 seconds to generate 1 language model.

@wasertech wasertech marked this pull request as ready for review July 4, 2022 13:18
@wasertech
Copy link
Collaborator Author

wasertech commented Jul 5, 2022

I'll close this PR as I've just merged it with #2211 inside #2253 to make available_cpu_count() available STT-wide as coqui_stt_training.util.cpu.available_count().

@wasertech wasertech closed this Jul 5, 2022
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants