Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Reduce custom entity parser footprint in training time #804

Merged
merged 8 commits into from
May 20, 2019

Conversation

ClemDoum
Copy link
Collaborator

@ClemDoum ClemDoum commented May 20, 2019

Initial behavior

When training the NLU on assistant that have an entity with a large number of values, we noticed the the training time / inference time of the NLU could be majorly impacted.

There were 2 reason for that:

  • first, the CustomEntityParser which is using a snips_nlu_parser.GazetteerEntityParser under the hood, was not making use of n_gazetteer_stop_words configuration parameter. Using stopwords when matching value of the gazetteer parser can have a dramatic impact on performances
  • secondly when validating the dataset, we generated way to many variations of the same entity value. On some gazetteer we were going from 50k initial values to 800k values. Generating a lot of variety in the entity values brings robustness but increase both training and inference time. Moreover the generating entity values variations when we already have a lot of values might have a limited effect on robustness

Work done

  • When building the snips_nlu_parser.GazetteerEntityParser we now use n_gazetteer_stop_words. We set n_gazetteer_stop_words = len(entity_voca) * 0.001 where entity_voca is the number of tokens in the entity vocabulary. This number was chosen after benchmarking several values and several entity data regime
  • Now we also now generate 3 of string variations differently depending on the data regime:
    • if we have less than 1000 entity values, we generate all string variations
    • if we have less between 1000 and 10000 value, we generate all variation except the number variations (which are the longest to generate since we have to run Rustling on all entity values)
    • if we have more than 10000 entity value we only generate normalization variations

Checklist:

  • My PR is ready for code review
  • I have added some tests, if applicable, and run the whole test suite, including linting tests
  • I have updated the documentation, if applicable

@ClemDoum ClemDoum requested a review from adrienball May 20, 2019 14:41
@ClemDoum ClemDoum force-pushed the task/improve-custom-entity-parser-performances branch from 14f77e2 to 4fa4a6a Compare May 20, 2019 14:42
@ClemDoum ClemDoum force-pushed the task/improve-custom-entity-parser-performances branch from 4fa4a6a to ca14459 Compare May 20, 2019 15:03
@ClemDoum ClemDoum merged commit 976113d into develop May 20, 2019
@ClemDoum ClemDoum deleted the task/improve-custom-entity-parser-performances branch May 20, 2019 15:47
@ClemDoum ClemDoum mentioned this pull request Jun 20, 2019
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants