Improve random state handling #801

ClemDoum · 2019-05-13T14:06:43Z

Description:
Currenlty

Due to some scikit-learn bug the intent classification training was not deterministic
Some data augmentation code was also making the training non deterministic

Done:

Integrated sklearn==0.21 which contains a fix which makes SGDClassifier training deterministic
Moved the NLU random state from the config to the share resources
Fixed a couple of bugs in data augmentation which made the training non deterministic

Checklist:

My PR is ready for code review
I have added some tests, if applicable, and run the whole test suite, including linting tests
I have updated the documentation, if applicable

Squashed commits: [58d1612] Remove seed from config and put it in the "shared" for the ProcessingUnit

…m-seed

codecov-io · 2019-05-16T09:15:16Z

Codecov Report

Merging #801 into develop will increase coverage by 0.04%.
The diff coverage is 100%.

@@             Coverage Diff             @@
##           develop     #801      +/-   ##
===========================================
+ Coverage    88.42%   88.47%   +0.04%     
===========================================
  Files           76       76              
  Lines         4571     4571              
  Branches       882      882              
===========================================
+ Hits          4042     4044       +2     
+ Misses         397      395       -2     
  Partials       132      132

adrienball · 2019-05-16T12:21:00Z

snips_nlu/intent_classifier/log_reg_classifier_utils.py

    while True:
        noise_length = int(random_state.normal(mean_length, std_length))
+        i += 1


Unused variable i

adrienball · 2019-05-16T12:25:55Z

snips_nlu/tests/test_crf_slot_filler.py

@@ -35,8 +35,9 @@ def test_should_get_slots(self):
 - make me [number_of_cups:snips/number](five) cups of tea
 - please I want [number_of_cups](two) cups of tea""")
        dataset = Dataset.from_yaml_files("en", [dataset_stream]).json
-        config = CRFSlotFillerConfig(random_seed=42)
+        config = CRFSlotFillerConfig()


The config is not needed anymore here.

adrienball · 2019-05-16T12:26:15Z

snips_nlu/tests/test_crf_slot_filler.py

@@ -101,10 +104,11 @@ def test_should_get_sub_builtin_slots(self):
 - find an activity from [start](6pm) to [end](8pm)
 - Book me a trip from [start](this friday) to [end](next tuesday)""")
        dataset = Dataset.from_yaml_files("en", [dataset_stream]).json
-        config = CRFSlotFillerConfig(random_seed=42)
+        config = CRFSlotFillerConfig()


Same comment

adrienball · 2019-05-16T12:26:17Z

snips_nlu/tests/test_crf_slot_filler.py

@@ -65,9 +66,11 @@ def test_should_get_builtin_slots(self):
 - Can you tell me the weather [datetime] please ?
 - what is the weather forecast [datetime] in [location](paris)""")
        dataset = Dataset.from_yaml_files("en", [dataset_stream]).json
-        config = CRFSlotFillerConfig(random_seed=42)
+        config = CRFSlotFillerConfig()


The config is not needed anymore here.

adrienball · 2019-05-16T12:26:25Z

snips_nlu/tests/test_crf_slot_filler.py

@@ -356,9 +360,10 @@ def test_should_get_slots_after_deserialization(self):
 - i want [number_of_cups] cups of tea please
 - can you prepare [number_of_cups] cups of tea ?""")
        dataset = Dataset.from_yaml_files("en", [dataset_stream]).json
-        config = CRFSlotFillerConfig(random_seed=42)
+        config = CRFSlotFillerConfig()


Same comment

adrienball · 2019-05-16T12:31:17Z

snips_nlu/tests/test_probabilistic_intent_parser.py

-        classifier_config = LogRegIntentClassifierConfig(random_seed=42)
-        slot_filler_config = CRFSlotFillerConfig(random_seed=42)
+        classifier_config = LogRegIntentClassifierConfig()
+        slot_filler_config = CRFSlotFillerConfig()
        parser_config = ProbabilisticIntentParserConfig(


Same comment

adrienball · 2019-05-16T12:31:27Z

snips_nlu/tests/test_probabilistic_intent_parser.py

-        classifier_config = LogRegIntentClassifierConfig(random_seed=42)
-        slot_filler_config = CRFSlotFillerConfig(random_seed=42)
+        classifier_config = LogRegIntentClassifierConfig()
+        slot_filler_config = CRFSlotFillerConfig()
        parser_config = ProbabilisticIntentParserConfig(


Same comment

adrienball · 2019-05-16T12:31:45Z

snips_nlu/tests/test_probabilistic_intent_parser.py

@@ -162,9 +169,12 @@ def test_should_get_intents(self):
 utterances:
  - yili yulu yele""")
        dataset = Dataset.from_yaml_files("en", [dataset_stream]).json
-        classifier_config = LogRegIntentClassifierConfig(random_seed=42)
+        classifier_config = LogRegIntentClassifierConfig()
        parser_config = ProbabilisticIntentParserConfig(classifier_config)


Same comment

adrienball · 2019-05-16T12:31:59Z

snips_nlu/tests/test_probabilistic_intent_parser.py

-                random_seed=seed1),
-            slot_filler_config=CRFSlotFillerConfig(random_seed=seed2)
+            intent_classifier_config=LogRegIntentClassifierConfig(),
+            slot_filler_config=CRFSlotFillerConfig()
        )


Same comment

adrienball · 2019-05-16T12:41:09Z

docs/source/tutorial.rst

+different outputs.
+
+If you want to run training in a reproducible way you can pass a random seed to
+your engine:


I prefer using a more impersonal form in the documentation, but that's just a suggestion. That would be something like:

Reproducible training and testing can be achieved by passing a **random seed** to the engine:

adrienball · 2019-05-16T12:48:16Z

docs/source/tutorial.rst

@@ -174,6 +174,26 @@ the dataset we generated earlier:

    engine.fit(dataset)

+Note that by default, the training of the engine is non-deterministic: if you
+train your NLU twice on the same data and test it on the same input, you'll get
+different outputs.


I would be a bit more optimistic in the formulation:

Note that, by default, the training of the NLU engine is a non-deterministic process: training and testing multiple times on the same data may produce different outputs.

ClemDoum added 9 commits March 27, 2019 16:11

Fix (+1 squashed commit)

2e8863c

Squashed commits: [58d1612] Remove seed from config and put it in the "shared" for the ProcessingUnit

save work

fa69243

Fix setup

16b910c

Merge remote-tracking branch 'origin/develop' into task/improve-rando…

d647793

…m-seed

Bump sklearn

2fee382

Fix setup

2257f3a

Test deterministic behavior

2215f52

Linting

9462eca

Doc

6ff0eea

ClemDoum force-pushed the task/improve-random-seed branch from e29cc06 to 6ff0eea Compare May 13, 2019 14:07

Fix tests

89f064c

Fix Python2.7 tests

9e43dfd

ClemDoum force-pushed the task/improve-random-seed branch from f956a78 to 9e43dfd Compare May 16, 2019 09:21

Linting

c3383a6

ClemDoum requested a review from adrienball May 16, 2019 09:47

adrienball requested changes May 16, 2019

View reviewed changes

adrienball reviewed May 16, 2019

View reviewed changes

Fixes after review

45f4fd4

ClemDoum force-pushed the task/improve-random-seed branch from ae0633e to 45f4fd4 Compare May 20, 2019 14:03

ClemDoum requested a review from adrienball May 20, 2019 14:05

adrienball approved these changes May 20, 2019

View reviewed changes

ClemDoum merged commit 6416624 into develop May 20, 2019

ClemDoum deleted the task/improve-random-seed branch May 20, 2019 14:15

adrienball mentioned this pull request Jun 11, 2019

Deployed assistant different accuracy/behavior from online assistant results snipsco/snips-issues#154

Closed

ClemDoum mentioned this pull request Jun 20, 2019

Release/0.19.7 #813

Merged

adrienball mentioned this pull request Jun 24, 2019

Random seeds and deterministic trainings #779

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve random state handling #801

Improve random state handling #801

ClemDoum commented May 13, 2019 •

edited

Loading

codecov-io commented May 16, 2019 •

edited

Loading

adrienball May 16, 2019

adrienball May 16, 2019

adrienball May 16, 2019

adrienball May 16, 2019

adrienball May 16, 2019

adrienball May 16, 2019

adrienball May 16, 2019

adrienball May 16, 2019

adrienball May 16, 2019

adrienball May 16, 2019

adrienball May 16, 2019 •

edited

Loading

Improve random state handling #801

Improve random state handling #801

Conversation

ClemDoum commented May 13, 2019 • edited Loading

codecov-io commented May 16, 2019 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adrienball May 16, 2019 • edited Loading

Choose a reason for hiding this comment

ClemDoum commented May 13, 2019 •

edited

Loading

codecov-io commented May 16, 2019 •

edited

Loading

adrienball May 16, 2019 •

edited

Loading