Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

fix: node_sample_weight array dtype with xgboost explainer #52

Merged
merged 5 commits into from
May 22, 2023

Conversation

thatlittleboy
Copy link
Collaborator

Resolves #48 .

Changes

Please see individual commits in the PR for the changelog.

  • The main fix involves specifying the dtype of the node_sample_weight numpy array to be double type. This makes it compatible with the type declared in the cext (_cext.cc). Previously, the array was declared to be float32 in Python, and double / float64 in the C extension (_cext_dense_tree_update_weights C function). When the dtypes don't match up, it causes the C extension to not be able to properly pass the updated weights array back to Python..
  • Updated the test_provided_background_tree_path_dependent test to use the adult() dataset instead of iris(), and under-specify the model (drastically lower the xgboost depth and num_boost_rounds, otherwise we end up with too many leaves/terminal nodes in the model and there'll be some nodes with 0 weights).
  • Some minor docs and whitespace fixes.

Some history

The actual manifestation of this problem comes from the test_provided_background_tree_path_dependent test failing.
slundberg tried to resolve this problem in PR slundberg#2781, but didn't work.

The actual root cause is what is presented in #48 , that no matter what background dataset was provided, the node_sample_weight array would still always end up as an array of 0's, thus failing the test.

The dtype here needs to match with the dtype declared in the _cext.cc file, which is
NPY_DOUBLE.
because we need a binary label. Iris dataset is not binary
@thatlittleboy thatlittleboy changed the title fix: node_sample_weight array dtype with xgboost booster fix: node_sample_weight array dtype with xgboost explainer May 19, 2023
@thatlittleboy thatlittleboy added the bug Something isn't working label May 21, 2023
Copy link
Collaborator

@connortann connortann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me! Great that the tests are nearly passing now.

Does this have any change to the public API - might it change the behaviour of the TreeExplainer? If so, it would be good to ensure this is documented, probably in the Changelog in #59

@thatlittleboy
Copy link
Collaborator Author

thatlittleboy commented May 22, 2023

Yes, it does in some way (albeit a very narrow possibility). It only affects TreeExplainer, when explaining xgboost models, when using tree_path_dependent feature_perturbation and yet the user passes in background data anyway (which is not required when using tree path dependent).
You may refer to the failing test from before for an example.

But in such a scenario, the shap_values should be calculated without error now (previously would raise assertion error)

@connortann connortann merged commit f72ff6e into master May 22, 2023
@connortann connortann deleted the fix-tree-path-dependent branch May 22, 2023 13:00
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Node sample weights are not being updated in xgboost tree_path_dependent
2 participants