Skip to content

Commit

Permalink
ENH compute_marginal and plot_marginal (#162)
Browse files Browse the repository at this point in the history
* ENH add private _compute_marginal

* ENH add plot_marginal

* TST add more tests

* ENH make compute_marginal public API

- X, feature_name insteaf of feature in compute_marginal and plot_marginal

* ENH add private array / dataframe utilities

* ENH add private compute_partial_dependence

* ENH add partial dependence to compute_marginal

* ENH add compute_marginal

* ENH add uniform binning and improve plot_marginal

* TST more tests for compute_marginal and plot_marginal

* ENH add feature std to bin_edges array

* ENH avoid pandas deprecation warning when assigning

* DOC add plot_marginal to regression example and to README

* FIX label in plotly add_scatter

* FIX X=None, typing headaches, docstring and cleanup
  • Loading branch information
lorentzenchr authored Aug 8, 2024
1 parent face270 commit 2c960f2
Show file tree
Hide file tree
Showing 15 changed files with 2,375 additions and 65 deletions.
3 changes: 2 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,10 +12,11 @@
Highlights:

- All common point predictions covered: mean, median, quantiles, expectiles.
- Assess model calibration with [identification functions](https://lorentzenchr.github.io/model-diagnostics/reference/model_diagnostics/calibration/identification/#model_diagnostics.calibration.identification.identification_function) (generalized residuals) and [compute_bias](https://lorentzenchr.github.io/model-diagnostics/reference/model_diagnostics/calibration/identification/#model_diagnostics.calibration.identification.compute_bias).
- Assess model calibration with [identification functions](https://lorentzenchr.github.io/model-diagnostics/reference/model_diagnostics/calibration/identification/#model_diagnostics.calibration.identification.identification_function) (generalized residuals), [compute_bias](https://lorentzenchr.github.io/model-diagnostics/reference/model_diagnostics/calibration/identification/#model_diagnostics.calibration.identification.compute_bias) and [compute_marginal](https://lorentzenchr.github.io/model-diagnostics/reference/model_diagnostics/calibration/identification/#model_diagnostics.calibration.identification.compute_marginal).
- Assess calibration and bias graphically
- [reliability diagrams](https://lorentzenchr.github.io/model-diagnostics/reference/model_diagnostics/calibration/plots/#model_diagnostics.calibration.plots.plot_reliability_diagram) for auto-calibration
- [bias plots](https://lorentzenchr.github.io/model-diagnostics/reference/model_diagnostics/calibration/plots/#model_diagnostics.calibration.plots.plot_bias) for conditional calibration
- [marginal plots](https://lorentzenchr.github.io/model-diagnostics/reference/model_diagnostics/calibration/plots/#model_diagnostics.calibration.plots.plot_marginal) for average `y_obs`, `y_pred` and partial dependence for one feature
- Assess the predictive performance of models
- strictly consistent, homogeneous [scoring functions](https://lorentzenchr.github.io/model-diagnostics/reference/model_diagnostics/scoring/scoring/)
- [score decomposition](https://lorentzenchr.github.io/model-diagnostics/reference/model_diagnostics/scoring/scoring/#model_diagnostics.scoring.scoring.decompose) into miscalibration, discrimination and uncertainty
Expand Down
244 changes: 237 additions & 7 deletions docs/examples/regression_on_workers_compensation.ipynb

Large diffs are not rendered by default.

3 changes: 2 additions & 1 deletion docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,10 +12,11 @@
Highlights:

- All common point predictions covered: mean, median, quantiles, expectiles.
- Assess model calibration with [identification functions][model_diagnostics.calibration.identification.identification_function] (generalized residuals) and [compute_bias][model_diagnostics.calibration.identification.compute_bias].
- Assess model calibration with [identification functions][model_diagnostics.calibration.identification.identification_function] (generalized residuals), [compute_bias][model_diagnostics.calibration.identification.compute_bias] and [compute_marginal][model_diagnostics.calibration.identification.compute_marginal].
- Assess calibration and bias graphically
- [reliability diagrams][model_diagnostics.calibration.plots.plot_reliability_diagram] for auto-calibration
- [bias plots][model_diagnostics.calibration.plots.plot_bias] for conditional calibration
- [marginal plots][model_diagnostics.calibration.plots.plot_marginal] for average `y_obs`, `y_pred` and partial dependence for one feature
- Assess the predictive performance of models
- strictly consistent, homogeneous [scoring functions][model_diagnostics.scoring.scoring]
- [score decomposition][model_diagnostics.scoring.decompose] into miscalibration, discrimination and uncertainty
Expand Down
3 changes: 2 additions & 1 deletion mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -85,7 +85,8 @@ plugins:
# Other
show_bases: false
show_source: true
- mkdocs-jupyter
- mkdocs-jupyter:
include_requirejs: true # needed for plotly

markdown_extensions:

Expand Down
179 changes: 163 additions & 16 deletions src/model_diagnostics/_utils/array.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
import copy
import sys
from typing import Optional, Union

import numpy as np
Expand All @@ -24,18 +26,6 @@ def length_of_first_dimension(a: npt.ArrayLike) -> int:
raise ValueError(msg)


def validate_same_first_dimension(a: AL_or_polars, b: AL_or_polars) -> bool:
"""Validate that 2 array-like have the same length of the first dimension."""
if length_of_first_dimension(a) != length_of_first_dimension(b):
msg = (
"The two array-like objects don't have the same length of their first "
"dimension."
)
raise ValueError(msg)
else:
return True


def length_of_second_dimension(a: npt.ArrayLike) -> int:
"""Return length of second dimension."""
if not hasattr(a, "shape"):
Expand All @@ -56,23 +46,35 @@ def get_second_dimension(a: npt.ArrayLike, i: int) -> npt.ArrayLike:
if hasattr(a, "iloc"):
# pandas
return a.iloc[:, i]
elif hasattr(a, "column"):
elif hasattr(a, "column") and callable(a.column):
# pyarrow
return a.column(i) # a[i] would also work
elif isinstance(a, (list, tuple)):
return np.asarray(a)[:, i]
return np.array([row[i] for row in a])
else:
# numpy or polars
return a[:, i] # type: ignore


def validate_same_first_dimension(a: AL_or_polars, b: AL_or_polars) -> bool:
"""Validate that 2 array-like have the same length of the first dimension."""
if length_of_first_dimension(a) != length_of_first_dimension(b):
msg = (
"The two array-like objects don't have the same length of their first "
"dimension."
)
raise ValueError(msg)
else:
return True


def validate_2_arrays(
a: npt.ArrayLike, b: npt.ArrayLike
) -> tuple[np.ndarray, np.ndarray]:
"""Validate 2 arrays.
Both arrays are checked to have same dimensions and shapes and returned as numpy
arrays.
Both arrays are checked to have same dimensions and shapes.
They are returned as numpy arrays.
Returns
-------
Expand Down Expand Up @@ -163,3 +165,148 @@ def get_sorted_array_names(y_pred: Union[npt.ArrayLike, pl.Series, pl.DataFrame]
sorted_indices = [0]

return pred_names, sorted_indices


def is_pandas_df(x):
"""Return True if the x is a pandas DataFrame."""
try:
pd = sys.modules["pandas"]
except KeyError:
return False
return isinstance(x, pd.DataFrame)


def is_pyarrow_array(x):
"""Return True if the x is a pyarrow Array or ChunkedArray."""
try:
pa = sys.modules["pyarrow"]
except KeyError:
return False
return isinstance(x, (pa.Array, pa.ChunkedArray))


def is_pyarrow_table(x):
"""Return True if the x is a pyarrow Table or RecordBatch."""
try:
pa = sys.modules["pyarrow"]
except KeyError:
return False
return isinstance(x, (pa.Table, pa.RecordBatch))


def safe_assign_column(x, values, column_index):
"""Safely assign values array to a column of an array_like x.
Parameters
----------
x : array-like
Array to be modified. It is expected to be 2-dimensional.
values : ndarray
The values to be assigned to `x`.
column_index : int
Index of the column / second dimension.
Returns
-------
x : Modified `x` with the new assign column.
"""
if isinstance(x, list):
# Multiple rows may point to the same underlying object, e.g. a result of
# repeated indices like safe_index_rows(x, [0, 0, 1, 1]). Therefore, we must
# be careful, i.e. (shallow) copy the rows.
if hasattr(x[0], "copy"):

def copy_element(x):
return x.copy()
elif hasattr(x[0], "clone"):

def copy_element(x):
return x.clone()
else:

def copy_element(x):
return copy.copy(x)

try:
row = copy_element(x[0])
row[column_index] = values[0]
except Exception as e:
e.add_note("Unable to set item in safe_assign_column of a list object.")
raise
if row[column_index] != values[0]:
msg = "Elements of the list can't be assigned new vlues."
raise ValueError(msg)

for i in range(len(x)):
row = copy_element(x[i])
row[column_index] = values[i]
x[i] = row
elif is_pandas_df(x):
# Possible fix for older versions
# if isinstance(values, pl.Series):
# pd = sys.modules["pandas"]
# values = pd.api.interchange.from_dataframe(
# pl.DataFrame(values)).iloc[:, 0]
try:
# Avoid deprecation warning of pandas by handling dtype explicitly.
# Setting an item of incompatible dtype is deprecated and will raise in a
# future error of pandas.
pd = sys.modules["pandas"]
dtype = x.dtypes.iloc[column_index]
x.iloc[:, column_index] = pd.Series(
data=(values.to_pandas() if isinstance(values, pl.Series) else values),
dtype=dtype,
)
except Exception as e:
# FIXME: pyarrow version XXX
# Older pyarrow versions of AttributeError do not have a 'add_note' method.
args = e.args
msg = (
args[0]
+ "\nThe problem might be fixable with newer versions of pandas, polars"
" or pyarrow."
)
raise type(e)(msg, *args[1:]) from e
elif is_pyarrow_table(x):
x = x.set_column(column_index, x.column_names[column_index], [values])
elif isinstance(x, pl.DataFrame):
cname = x.columns[column_index]
dtype = x.get_column(cname).dtype
x = x.with_columns(pl.Series(values, dtype=dtype).alias(cname))
else: # numpy array or other array-like
x[:, column_index] = values
return x


def safe_index_rows(x, indices):
"""Safely index rows (first dimention) of an array-like x.
Parameters
----------
x : array-like
Array-like to be indexed on its first dimension.
indices : array-like of integers
Returns
-------
subset
Subset of x on first dimension. This may be a view.
"""
index = np.asarray(indices)
if index.dtype.kind not in ("i", "u"):
msg = "Only integer indives are allowed for indexing rows."
raise ValueError(msg)

if is_pyarrow_table(x) or is_pyarrow_array(x):
return x.take(indices)
elif hasattr(x, "iloc"):
# using take() instead of iloc[] ensures the return value is a "proper"
# copy that will not raise SettingWithCopyWarning
return x.take(indices, axis=0)
elif isinstance(x, (list, tuple)):
return [x[idx] for idx in indices]
else:
# numpy, polars
return x[indices]
Loading

0 comments on commit 2c960f2

Please # to comment.