-
Notifications
You must be signed in to change notification settings - Fork 400
Behavior of OneHotEncoder handle_unknown option #92
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Comments
Handle unknown is for handing categories that were not present in the data at time of fit. So if you train an encoder on a categorical variable with categories A, B, and C, handle unknown specifies what to do with a newly observed category D if is shows up in transform. NaN, is just another category. So if it's passed in as a NaN in the fit, then it's not unknown. |
I'm having similar issues, although with categories not present in the fit data set: raw_data = {'name': ['Dwigt', 'Jim', 'Angela', 'Phyllis', 'Michael_Scott'],
'nationality': ['USA', 'USA', 'France', 'UK', 'UK'],
'age': [42, 52, 36, 24, 70]}
train = pd.DataFrame(raw_data, columns = ['name', 'nationality', 'age'])
name nationality age
0 Dwigt USA 42
1 Jim USA 52
2 Angela France 36
3 Phyllis UK 24
4 Michael_Scott UK 70
test_data = {'name': ['Pam', 'Roy', 'Stanley', 'Ryan_Howard', 'Bob_Vance'],
'nationality': ['USA', 'Canada', 'France', 'Canada', 'UK'],
'age': [42, 52, 36, 24, 70]}
test = pd.DataFrame(test_data, columns = ['name', 'nationality', 'age'])
print(test)
name nationality age
0 Pam USA 42
1 Roy Canada 52
2 Stanley France 36
3 Ryan_Howard Canada 24
4 Bob_Vance UK 70
one_hot = ce.OneHotEncoder(drop_invariant = False, impute_missing = True, use_cat_names = True, return_df = True,
cols = ['nationality'], handle_unknown = 'impute'
)
enc = one_hot.fit(train)
train_enc_one = enc.transform(train)
test_enc_one = enc.transform(test)
print(test_enc_one)
nationality_USA nationality_France nationality_UK nationality_-1 \
0 1 0 0 0
1 0 0 0 0
2 0 1 0 0
3 0 0 0 0
4 0 0 1 0
name age
0 Pam 42
1 Roy 52
2 Stanley 36
3 Ryan_Howard 24
4 Bob_Vance 70 User-error, I'm sure, but not exactly understanding how to populate the example -1 column with truly new categorical values. |
@JohnnyC08 I think we should handle new categories in def test_one_hot_unknown(self):
X = pd.DataFrame({'col': ['A', 'B', 'C']})
X_t = pd.DataFrame({'col': ['A', 'B', 'C', 'D']})
encoder = encoders.OneHotEncoder(impute_missing=True, handle_unknown='impute')
result = encoder.fit(X).transform(X_t)
self.assertEqual(1, result['col_-1'].sum())
encoder = encoders.OneHotEncoder(impute_missing=True, handle_unknown='error')
encoder.fit(X)
self.assertRaises(ValueError, encoder.transform, X_t)
encoder = encoders.OneHotEncoder(impute_missing=True, handle_unknown='ignore')
result = encoder.fit(X).transform(X_t)
self.assertTrue(result.isnan().any().any()) Just note that it is merely a proposal. And that OrdinalEncoder currently returns 0 when set to impute. |
@janmotl Is it worth continuing to maintain |
@JohnnyC08, @wdm0006, @jkleint We have to come with a plan how to deal with:
My proposal:
If it turns out that we need a setting for missing value treatment, it will be encoder specific and possibly controlled with an argument like Implementation of the proposal would lead into removing |
In practice, does that imply that e.g. OrdinalEncoder would add two extra values to those present in the data, one for NaNs and one for new categorical values? |
Yes. TargetEncoder is in both scenarios going to return the global target average. Alternative options:
The first alternative option makes swapping of one encoder for another potentially a risky operation. The second alternative option keeps things modular - that is nice. And it simplifies the implementation of the encoders. |
Ok, I like the proposal you list here #92 (comment) Lets talk about what happens for each value of
So, for the ordinal encoder, a value for nan would only be added if there is nan in the training set. |
I came to the conclusion that we have to decouple treatment of missing values and unknown values. If they are independent, it will be easier to implement new treatments. And it will be easier to port them from one encoder to another. What does it mean? Each encoder will take 2 arguments: Some possible values for
Some possible values for
Justification of the special values:
Justification of having so many options:
Possible alterations:
@jkleint, @JohnnyC08 What are your opinions about the proposal? |
@janmotl Here are my thoughts So,
Here are my thoughts for handle_missing:
When we start implementing, we may want to tackle the OrdinalEncoder first because so many encoders use it. Then, we can circle around to OneHotEncoder to tackle this issue. What do you think of my interpretations? |
@JohnnyC08 I agree.
|
Ok, I'm gonna take a pass at the OrdinalEncoder first because many of the encoders rely on it |
@janmotl I've rethought it and I have revised what I think. So,
I have been implementing
I think part of my confusion is that for |
To conflicting thoughts:
Additional proposal:
|
Newbie to encoding but actively watching this issue due to work. In my opinions, there can be two ways of encoding 'nan' and unknown/unseen labels. First one is to allocate 'nan' and unseen labels all into the '-1' category during encoding; second one is to encode 'nan' as a category different from '-1' (get_dummies(dummy_na=True)). When I was using this package, similar to @mark-mediware's case, in my test data, there are unseen levels, but they are not encoded under '-1' category which I expect to see them there. So, just want to understand how I can properly encode test data with both missing values and unseen values when I implement machine learning? BTW, what is the timeline to expect a new version of this category_encoders package? Thanks! |
@xthomas8888 There are many ways how to deal with missing and unknown/unseen values (btw. I wouldn't be against renaming For example, a reasonably good treatment of missing values for regression models is to impute the missing value with the column average and introduce indicator column. If the value is missing completely at random, imputation with the average is sufficient and indicator column is going to have a weight insignificantly different from zero. On the other end, if the missingness is predictive, the presence of the indicator column is going to improve the model's accuracy. If the unseen values are unseen because they are rare, it makes sense to assign them into the "rare group" as discussed above. But if the unseen values are missing because the distribution of the data is changing in time, you had better to use online learning and models that can deal with concept drift. |
Solved by @JohnnyC08 |
I'm trying to understand the behavior (and intent) of the handle_unknown option for OneHotEncoder (and by extension OrdinalEncoder). The docs imply that this should control NaN handling but below examples seem to indicate otherwise (category_encoders==1.2.8)
In particular, 'error' and 'ignore' give the same behavior, treating missing observations as another category. 'impute' adds constant zero-valued columns but also treats missing observations as another category. Naively would've expected behavior similar to
pd.get_dummies(X, dummy_na={True|False})
, withhandle_unknown=ignore
corresponding todummy_na=False
.The text was updated successfully, but these errors were encountered: