Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Implement a new Training step that replaces Model Exploration step 3 #101

Merged
merged 19 commits into from
Oct 23, 2023

Conversation

riley-harper
Copy link
Contributor

Closes #60. Closes #29. Closes #30.

This PR adds a fourth step to the Training step - "save model metadata". This new step replaces the old Model Exploration step 3, which was buggy and undocumented.

The new Training step saves either coefficients or feature importances for the trained model to the training_feature_importances Spark table. It directly uses the machine learning model trained in the previous step. The output table is helpful for introspection of the model and understanding which features are most important for classification in the Matching task. The Training step does not generate features for any of the potential matches or do any classification itself. Those are jobs done by the Matching task.

By default Training step 3 is skipped. To enable it, users can set the training.feature_importances attribute to true in their config file.

Since Model Exploration was intended to serve this purpose but was not doing so, we are treating this as a bug fix instead of a breaking change. This change can go out in hlink 3.5.1.

riley-harper and others added 19 commits October 18, 2023 18:09
This is not hooked up and polished yet. The new step will save metadata
about the model trained in the previous step for inspection and debugging.
One of these tests is failing because LinkStepSaveModelMetadata is erroring out
when it's not skipped.
It is not serialized to the filesystem anymore. This does mean that it
doesn't persist between runs of hlink, so I added a note to that effect.
This seems to be working nicely. There's still a little bit to iron out about
feature importances vs. coefficients for different models.
We've done what the TODO comment here says and created a step in training that
does this for us. So we can get rid of this method to keep things organized.
We could deprecate this and treat it as a breaking change, but I see two reasons
to just go ahead and remove this as a bug fix:

1. This step is undocumented.
2. This step is really buggy and does not work at all. Users really should not be using
   it. Now training step 3 does what this step was supposed to do.
Add a doc comment and remove some unhelpful code comments.
I've documented that this step is available only for Training, not for Household
Training at the moment. We can easily add this functionality to Household Training
as well if it would be useful.
We need to access training.feature_importances, not a top-level feature_importances
attribute.
@riley-harper riley-harper marked this pull request as ready for review October 23, 2023 19:00
@riley-harper riley-harper merged commit 74d0dc6 into main Oct 23, 2023
@riley-harper riley-harper deleted the fix_get_feature_importances branch October 23, 2023 19:27
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
1 participant