Reduce custom entity parser footprint in training time #804
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Initial behavior
When training the NLU on assistant that have an entity with a large number of values, we noticed the the training time / inference time of the NLU could be majorly impacted.
There were 2 reason for that:
CustomEntityParser
which is using asnips_nlu_parser.GazetteerEntityParser
under the hood, was not making use ofn_gazetteer_stop_words
configuration parameter. Using stopwords when matching value of the gazetteer parser can have a dramatic impact on performances50k
initial values to800k
values. Generating a lot of variety in the entity values brings robustness but increase both training and inference time. Moreover the generating entity values variations when we already have a lot of values might have a limited effect on robustnessWork done
snips_nlu_parser.GazetteerEntityParser
we now usen_gazetteer_stop_words
. We setn_gazetteer_stop_words = len(entity_voca) * 0.001
whereentity_voca
is the number of tokens in the entity vocabulary. This number was chosen after benchmarking several values and several entity data regime1000
entity values, we generate all string variations1000
and10000
value, we generate all variation except the number variations (which are the longest to generate since we have to run Rustling on all entity values)10000
entity value we only generate normalization variationsChecklist: