Reduce custom entity parser footprint in training time #804

ClemDoum · 2019-05-20T13:24:55Z

Initial behavior

When training the NLU on assistant that have an entity with a large number of values, we noticed the the training time / inference time of the NLU could be majorly impacted.

There were 2 reason for that:

first, the CustomEntityParser which is using a snips_nlu_parser.GazetteerEntityParser under the hood, was not making use of n_gazetteer_stop_words configuration parameter. Using stopwords when matching value of the gazetteer parser can have a dramatic impact on performances
secondly when validating the dataset, we generated way to many variations of the same entity value. On some gazetteer we were going from 50k initial values to 800k values. Generating a lot of variety in the entity values brings robustness but increase both training and inference time. Moreover the generating entity values variations when we already have a lot of values might have a limited effect on robustness

Work done

When building the snips_nlu_parser.GazetteerEntityParser we now use n_gazetteer_stop_words. We set n_gazetteer_stop_words = len(entity_voca) * 0.001 where entity_voca is the number of tokens in the entity vocabulary. This number was chosen after benchmarking several values and several entity data regime
Now we also now generate 3 of string variations differently depending on the data regime:
- if we have less than 1000 entity values, we generate all string variations
- if we have less between 1000 and 10000 value, we generate all variation except the number variations (which are the longest to generate since we have to run Rustling on all entity values)
- if we have more than 10000 entity value we only generate normalization variations

Checklist:

My PR is ready for code review
I have added some tests, if applicable, and run the whole test suite, including linting tests
I have updated the documentation, if applicable

…m-entity-parser-performances

snips_nlu/entity_parser/custom_entity_parser.py

snips_nlu/intent_classifier/log_reg_classifier.py

ClemDoum added 6 commits May 15, 2019 12:09

Add stopwords when fitting the custom entity parser

4499dfa

Save work

1c253bd

Merge remote-tracking branch 'origin/develop' into task/improve-custo…

ab72ace

…m-entity-parser-performances

Custom entity parser improvments

4417322

Custom entity parser improvments

eec1b38

Improve test

337c608

adrienball requested changes May 20, 2019

View reviewed changes

snips_nlu/entity_parser/custom_entity_parser.py Outdated Show resolved Hide resolved

snips_nlu/intent_classifier/log_reg_classifier.py Outdated Show resolved Hide resolved

Fix linting

579d911

ClemDoum requested a review from adrienball May 20, 2019 14:41

ClemDoum force-pushed the task/improve-custom-entity-parser-performances branch from 14f77e2 to 4fa4a6a Compare May 20, 2019 14:42

adrienball approved these changes May 20, 2019

View reviewed changes

Fixes after review

ca14459

ClemDoum force-pushed the task/improve-custom-entity-parser-performances branch from 4fa4a6a to ca14459 Compare May 20, 2019 15:03

ClemDoum merged commit 976113d into develop May 20, 2019

ClemDoum deleted the task/improve-custom-entity-parser-performances branch May 20, 2019 15:47

ClemDoum mentioned this pull request Jun 20, 2019

Release/0.19.7 #813

Merged

Corasonn mentioned this pull request Sep 29, 2020

Umlauts within phrase are causing odd intent matches #886

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce custom entity parser footprint in training time #804

Reduce custom entity parser footprint in training time #804

ClemDoum commented May 20, 2019 •

edited

Loading

Reduce custom entity parser footprint in training time #804

Reduce custom entity parser footprint in training time #804

Conversation

ClemDoum commented May 20, 2019 • edited Loading

Initial behavior

Work done

ClemDoum commented May 20, 2019 •

edited

Loading