Implement a new Training step that replaces Model Exploration step 3 #101

riley-harper · 2023-10-23T18:17:44Z

Closes #60. Closes #29. Closes #30.

This PR adds a fourth step to the Training step - "save model metadata". This new step replaces the old Model Exploration step 3, which was buggy and undocumented.

The new Training step saves either coefficients or feature importances for the trained model to the training_feature_importances Spark table. It directly uses the machine learning model trained in the previous step. The output table is helpful for introspection of the model and understanding which features are most important for classification in the Matching task. The Training step does not generate features for any of the potential matches or do any classification itself. Those are jobs done by the Matching task.

By default Training step 3 is skipped. To enable it, users can set the training.feature_importances attribute to true in their config file.

Since Model Exploration was intended to serve this purpose but was not doing so, we are treating this as a bug fix instead of a breaking change. This change can go out in hlink 3.5.1.

This is not hooked up and polished yet. The new step will save metadata about the model trained in the previous step for inspection and debugging.

… to true in the config file

One of these tests is failing because LinkStepSaveModelMetadata is erroring out when it's not skipped.

It is not serialized to the filesystem anymore. This does mean that it doesn't persist between runs of hlink, so I added a note to that effect.

This seems to be working nicely. There's still a little bit to iron out about feature importances vs. coefficients for different models.

We've done what the TODO comment here says and created a step in training that does this for us. So we can get rid of this method to keep things organized.

We could deprecate this and treat it as a breaking change, but I see two reasons to just go ahead and remove this as a bug fix: 1. This step is undocumented. 2. This step is really buggy and does not work at all. Users really should not be using it. Now training step 3 does what this step was supposed to do.

Add a doc comment and remove some unhelpful code comments.

…portance

I've documented that this step is available only for Training, not for Household Training at the moment. We can easily add this functionality to Household Training as well if it would be useful.

We need to access training.feature_importances, not a top-level feature_importances attribute.

Add testing

riley-harper and others added 19 commits October 18, 2023 18:09

[#60] Copy Model Exploration step 3 to a new Training step 3

d22024f

This is not hooked up and polished yet. The new step will save metadata about the model trained in the previous step for inspection and debugging.

[#60] Hook up the new save model metadata training step 3

8622b44

[#60] Skip training step 3 unless training.feature_importances is set…

d26daa4

… to true in the config file

[#60] Add some tests for training step 3

2f3144e

One of these tests is failing because LinkStepSaveModelMetadata is erroring out when it's not skipped.

[#60] Remove some now-redundant if statements from training step 3

1ccc9f7

[#60] Read the model from the trained_models dict

0aedb7e

It is not serialized to the filesystem anymore. This does mean that it doesn't persist between runs of hlink, so I added a note to that effect.

[#60] Implement training step 3 to get feature importances

d85d28f

This seems to be working nicely. There's still a little bit to iron out about feature importances vs. coefficients for different models.

[#60] Remove an unused private method from matching's score step

3906b58

We've done what the TODO comment here says and created a step in training that does this for us. So we can get rid of this method to keep things organized.

[#60] Remove an unused import

53be2b9

[#60] Improve documentation for training save model metadata step

ccf60c4

Add a doc comment and remove some unhelpful code comments.

[#60] Rename training_feature_importances column to coefficient_or_im…

8edbd88

…portance

[#60] Sort training_feature_importances on the new column name

47347b4

test commit

879fca0

[#60] Update documentation for the new training step 3

d793638

I've documented that this step is available only for Training, not for Household Training at the moment. We can easily add this functionality to Household Training as well if it would be useful.

[#60] Fix a bug that appeared in two training step 3 tests

d59f1ea

We need to access training.feature_importances, not a top-level feature_importances attribute.

added some additional testing locking in what's currently there

28e37d0

blacked

5432bf8

Merge pull request #102 from jrbalch543/add_testing

440cdff

Add testing

riley-harper marked this pull request as ready for review October 23, 2023 19:00

riley-harper merged commit 74d0dc6 into main Oct 23, 2023

riley-harper deleted the fix_get_feature_importances branch October 23, 2023 19:27

riley-harper mentioned this pull request Oct 30, 2023

Document Model Exploration step 3 #25

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement a new Training step that replaces Model Exploration step 3 #101

Implement a new Training step that replaces Model Exploration step 3 #101

riley-harper commented Oct 23, 2023

Implement a new Training step that replaces Model Exploration step 3 #101

Implement a new Training step that replaces Model Exploration step 3 #101

Conversation

riley-harper commented Oct 23, 2023