Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Improve random state handling #801

Merged
merged 13 commits into from
May 20, 2019
Merged

Improve random state handling #801

merged 13 commits into from
May 20, 2019

Conversation

ClemDoum
Copy link
Collaborator

@ClemDoum ClemDoum commented May 13, 2019

Description:
Currenlty

  • Due to some scikit-learn bug the intent classification training was not deterministic
  • Some data augmentation code was also making the training non deterministic

Done:

  • Integrated sklearn==0.21 which contains a fix which makes SGDClassifier training deterministic
  • Moved the NLU random state from the config to the share resources
  • Fixed a couple of bugs in data augmentation which made the training non deterministic

Checklist:

  • My PR is ready for code review
  • I have added some tests, if applicable, and run the whole test suite, including linting tests
  • I have updated the documentation, if applicable

@ClemDoum ClemDoum force-pushed the task/improve-random-seed branch from e29cc06 to 6ff0eea Compare May 13, 2019 14:07
@codecov-io
Copy link

codecov-io commented May 16, 2019

Codecov Report

Merging #801 into develop will increase coverage by 0.04%.
The diff coverage is 100%.

@@             Coverage Diff             @@
##           develop     #801      +/-   ##
===========================================
+ Coverage    88.42%   88.47%   +0.04%     
===========================================
  Files           76       76              
  Lines         4571     4571              
  Branches       882      882              
===========================================
+ Hits          4042     4044       +2     
+ Misses         397      395       -2     
  Partials       132      132

@ClemDoum ClemDoum force-pushed the task/improve-random-seed branch from f956a78 to 9e43dfd Compare May 16, 2019 09:21
@ClemDoum ClemDoum requested a review from adrienball May 16, 2019 09:47
while True:
noise_length = int(random_state.normal(mean_length, std_length))
i += 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused variable i

@@ -35,8 +35,9 @@ def test_should_get_slots(self):
- make me [number_of_cups:snips/number](five) cups of tea
- please I want [number_of_cups](two) cups of tea""")
dataset = Dataset.from_yaml_files("en", [dataset_stream]).json
config = CRFSlotFillerConfig(random_seed=42)
config = CRFSlotFillerConfig()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The config is not needed anymore here.

@@ -101,10 +104,11 @@ def test_should_get_sub_builtin_slots(self):
- find an activity from [start](6pm) to [end](8pm)
- Book me a trip from [start](this friday) to [end](next tuesday)""")
dataset = Dataset.from_yaml_files("en", [dataset_stream]).json
config = CRFSlotFillerConfig(random_seed=42)
config = CRFSlotFillerConfig()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment

@@ -65,9 +66,11 @@ def test_should_get_builtin_slots(self):
- Can you tell me the weather [datetime] please ?
- what is the weather forecast [datetime] in [location](paris)""")
dataset = Dataset.from_yaml_files("en", [dataset_stream]).json
config = CRFSlotFillerConfig(random_seed=42)
config = CRFSlotFillerConfig()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The config is not needed anymore here.

@@ -356,9 +360,10 @@ def test_should_get_slots_after_deserialization(self):
- i want [number_of_cups] cups of tea please
- can you prepare [number_of_cups] cups of tea ?""")
dataset = Dataset.from_yaml_files("en", [dataset_stream]).json
config = CRFSlotFillerConfig(random_seed=42)
config = CRFSlotFillerConfig()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment

classifier_config = LogRegIntentClassifierConfig(random_seed=42)
slot_filler_config = CRFSlotFillerConfig(random_seed=42)
classifier_config = LogRegIntentClassifierConfig()
slot_filler_config = CRFSlotFillerConfig()
parser_config = ProbabilisticIntentParserConfig(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment

classifier_config = LogRegIntentClassifierConfig(random_seed=42)
slot_filler_config = CRFSlotFillerConfig(random_seed=42)
classifier_config = LogRegIntentClassifierConfig()
slot_filler_config = CRFSlotFillerConfig()
parser_config = ProbabilisticIntentParserConfig(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment

@@ -162,9 +169,12 @@ def test_should_get_intents(self):
utterances:
- yili yulu yele""")
dataset = Dataset.from_yaml_files("en", [dataset_stream]).json
classifier_config = LogRegIntentClassifierConfig(random_seed=42)
classifier_config = LogRegIntentClassifierConfig()
parser_config = ProbabilisticIntentParserConfig(classifier_config)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment

random_seed=seed1),
slot_filler_config=CRFSlotFillerConfig(random_seed=seed2)
intent_classifier_config=LogRegIntentClassifierConfig(),
slot_filler_config=CRFSlotFillerConfig()
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment

different outputs.

If you want to run training in a reproducible way you can pass a random seed to
your engine:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer using a more impersonal form in the documentation, but that's just a suggestion. That would be something like:

Reproducible training and testing can be achieved by passing a 
**random seed** to the engine:

@@ -174,6 +174,26 @@ the dataset we generated earlier:

engine.fit(dataset)

Note that by default, the training of the engine is non-deterministic: if you
train your NLU twice on the same data and test it on the same input, you'll get
different outputs.
Copy link
Contributor

@adrienball adrienball May 16, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be a bit more optimistic in the formulation:

Note that, by default, the training of the NLU engine is a non-deterministic process: 
training and testing multiple times on the same data may produce different outputs.

@ClemDoum ClemDoum force-pushed the task/improve-random-seed branch from ae0633e to 45f4fd4 Compare May 20, 2019 14:03
@ClemDoum ClemDoum requested a review from adrienball May 20, 2019 14:05
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants