Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Dimension errors when using sklearn OneHotEncoder with min_frequency parameter #545

Open
dclaz opened this issue Nov 4, 2022 · 1 comment

Comments

@dclaz
Copy link

dclaz commented Nov 4, 2022

The documentation suggests that the sklearn OneHotEncoder should be a viable transformation when using the MimicExplainer, but I'm getting errors if I use it and set the min_frequency parameter to remove category levels with low counts.

If I set up my data preprocessor like this

image

(where I have ~7 categorical features, each with many levels)

# Define categorical transformer
categorical_transformer = Pipeline(
    steps=[
        ("cat_impute", SimpleImputer(strategy="constant", fill_value='missing')),
        ("onehot", OneHotEncoder(drop=None, handle_unknown="infrequent_if_exist", sparse=False, min_frequency=0.01)),
    ]
)
# Define numeric transformer
numeric_transformer = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="median")),
        ("scaler", StandardScaler()),
    ]
)

data_preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features)       
    ],
    remainder="drop",
)

I get the following error
image

However, if I set a different transformer for each categorical feature, the Explainer works, albeit with a Many to one/many maps found in input warning and produces outputs that don't really make sense (Half the features end up having very, very similar SHAP values).

image

# Define categorical transformer
categorical_transformer = Pipeline(
    steps=[
        ("cat_impute", SimpleImputer(strategy="constant", fill_value='missing')),
        ("onehot", OneHotEncoder(drop=None, handle_unknown="infrequent_if_exist", sparse=False, min_frequency=0.01)),
    ]
)
# Define numeric transformer
numeric_transformer = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="median")),
        ("scaler", StandardScaler()),
    ]
)

# Construct list of categorical transformers 
categorical_treatments_list = [(feature, categorical_transformer, [feature]) for feature in categorical_features]

# Construct the data preprocessor
data_preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_features),
        *categorical_treatments_list
    ],
    remainder="drop",
)
@paulbkoch
Copy link

Hi @dclaz -- This appears to be a question for the interpret-community repo. Transferring your issue there.

@paulbkoch paulbkoch transferred this issue from interpretml/interpret Nov 24, 2022
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Development

No branches or pull requests

2 participants