-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Implement a new Training step that replaces Model Exploration step 3 #101
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This is not hooked up and polished yet. The new step will save metadata about the model trained in the previous step for inspection and debugging.
… to true in the config file
One of these tests is failing because LinkStepSaveModelMetadata is erroring out when it's not skipped.
It is not serialized to the filesystem anymore. This does mean that it doesn't persist between runs of hlink, so I added a note to that effect.
This seems to be working nicely. There's still a little bit to iron out about feature importances vs. coefficients for different models.
We've done what the TODO comment here says and created a step in training that does this for us. So we can get rid of this method to keep things organized.
We could deprecate this and treat it as a breaking change, but I see two reasons to just go ahead and remove this as a bug fix: 1. This step is undocumented. 2. This step is really buggy and does not work at all. Users really should not be using it. Now training step 3 does what this step was supposed to do.
Add a doc comment and remove some unhelpful code comments.
I've documented that this step is available only for Training, not for Household Training at the moment. We can easily add this functionality to Household Training as well if it would be useful.
We need to access training.feature_importances, not a top-level feature_importances attribute.
Add testing
# for free
to join this conversation on GitHub.
Already have an account?
# to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Closes #60. Closes #29. Closes #30.
This PR adds a fourth step to the Training step - "save model metadata". This new step replaces the old Model Exploration step 3, which was buggy and undocumented.
The new Training step saves either coefficients or feature importances for the trained model to the
training_feature_importances
Spark table. It directly uses the machine learning model trained in the previous step. The output table is helpful for introspection of the model and understanding which features are most important for classification in the Matching task. The Training step does not generate features for any of the potential matches or do any classification itself. Those are jobs done by the Matching task.By default Training step 3 is skipped. To enable it, users can set the
training.feature_importances
attribute to true in their config file.Since Model Exploration was intended to serve this purpose but was not doing so, we are treating this as a bug fix instead of a breaking change. This change can go out in hlink 3.5.1.