Skip to content

Behavior of OneHotEncoder handle_unknown option #92

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Closed
multiloc opened this issue Jul 31, 2018 · 17 comments
Closed

Behavior of OneHotEncoder handle_unknown option #92

multiloc opened this issue Jul 31, 2018 · 17 comments
Labels

Comments

@multiloc
Copy link

I'm trying to understand the behavior (and intent) of the handle_unknown option for OneHotEncoder (and by extension OrdinalEncoder). The docs imply that this should control NaN handling but below examples seem to indicate otherwise (category_encoders==1.2.8)

In [2]: import pandas as pd
   ...: import numpy as np
   ...: from category_encoders import OneHotEncoder
   ...: 

In [3]: X = pd.DataFrame({'a': ['foo', 'bar', 'bar'],
   ...:                   'b': ['qux', np.nan, 'foo']})
   ...: X
   ...: 
Out[3]: 
     a    b
0  foo  qux
1  bar  NaN
2  bar  foo

In [4]: encoder = OneHotEncoder(cols=['a', 'b'], handle_unknown='ignore', 
   ...:                         impute_missing=True, use_cat_names=True)
   ...: encoder.fit_transform(X)
   ...: 
Out[4]: 
   a_foo  a_bar  b_qux  b_nan  b_foo
0      1      0      1      0      0
1      0      1      0      1      0
2      0      1      0      0      1

In [5]: encoder = OneHotEncoder(cols=['a', 'b'], handle_unknown='impute', 
   ...:                         impute_missing=True, use_cat_names=True)
   ...: encoder.fit_transform(X)
   ...: 
Out[5]: 
   a_foo  a_bar  a_-1  b_qux  b_nan  b_foo  b_-1
0      1      0     0      1      0      0     0
1      0      1     0      0      1      0     0
2      0      1     0      0      0      1     0

In [6]: encoder = OneHotEncoder(cols=['a', 'b'], handle_unknown='error', 
   ...:                         impute_missing=True, use_cat_names=True)
   ...: encoder.fit_transform(X)
   ...: 
Out[6]: 
   a_foo  a_bar  b_qux  b_nan  b_foo
0      1      0      1      0      0
1      0      1      0      1      0
2      0      1      0      0      1

In [7]: encoder = OneHotEncoder(cols=['a', 'b'], handle_unknown='ignore', 
   ...:                         impute_missing=False, use_cat_names=True)
   ...: encoder.fit_transform(X)
   ...: 
Out[7]: 
   a_foo  a_bar  b_qux  b_nan  b_foo
0      1      0      1      0      0
1      0      1      0      1      0
2      0      1      0      0      1

In particular, 'error' and 'ignore' give the same behavior, treating missing observations as another category. 'impute' adds constant zero-valued columns but also treats missing observations as another category. Naively would've expected behavior similar to pd.get_dummies(X, dummy_na={True|False}), with handle_unknown=ignore corresponding to dummy_na=False.

@wdm0006
Copy link
Collaborator

wdm0006 commented Aug 4, 2018

Handle unknown is for handing categories that were not present in the data at time of fit. So if you train an encoder on a categorical variable with categories A, B, and C, handle unknown specifies what to do with a newly observed category D if is shows up in transform.

NaN, is just another category. So if it's passed in as a NaN in the fit, then it's not unknown.

@mark-mediware
Copy link

I'm having similar issues, although with categories not present in the fit data set:

raw_data = {'name': ['Dwigt', 'Jim', 'Angela', 'Phyllis', 'Michael_Scott'],
        'nationality': ['USA', 'USA', 'France', 'UK', 'UK'],
        'age': [42, 52, 36, 24, 70]}
train = pd.DataFrame(raw_data, columns = ['name', 'nationality', 'age'])

            name nationality  age
0          Dwigt         USA   42
1            Jim         USA   52
2         Angela      France   36
3        Phyllis          UK   24
4  Michael_Scott          UK   70


test_data = {'name': ['Pam', 'Roy', 'Stanley', 'Ryan_Howard', 'Bob_Vance'],
        'nationality': ['USA', 'Canada', 'France', 'Canada', 'UK'],
        'age': [42, 52, 36, 24, 70]}
test = pd.DataFrame(test_data, columns = ['name', 'nationality', 'age'])

print(test)
          name nationality  age
0          Pam         USA   42
1          Roy      Canada   52
2      Stanley      France   36
3  Ryan_Howard      Canada   24
4    Bob_Vance          UK   70

one_hot = ce.OneHotEncoder(drop_invariant = False, impute_missing = True, use_cat_names = True, return_df = True,
                          cols = ['nationality'], handle_unknown = 'impute'
                          )

enc = one_hot.fit(train)
train_enc_one = enc.transform(train)
test_enc_one = enc.transform(test)

print(test_enc_one)
   nationality_USA  nationality_France  nationality_UK  nationality_-1  \
0                1                   0               0               0   
1                0                   0               0               0   
2                0                   1               0               0   
3                0                   0               0               0   
4                0                   0               1               0   

          name  age  
0          Pam   42  
1          Roy   52  
2      Stanley   36  
3  Ryan_Howard   24  
4    Bob_Vance   70  

User-error, I'm sure, but not exactly understanding how to populate the example -1 column with truly new categorical values.

@janmotl
Copy link
Collaborator

janmotl commented Oct 24, 2018

@JohnnyC08 I think we should handle new categories in get_dummies. A sketch of the test:

    def test_one_hot_unknown(self):
        X = pd.DataFrame({'col': ['A', 'B', 'C']})
        X_t = pd.DataFrame({'col': ['A', 'B', 'C', 'D']})

        encoder = encoders.OneHotEncoder(impute_missing=True, handle_unknown='impute')
        result = encoder.fit(X).transform(X_t)
        self.assertEqual(1, result['col_-1'].sum())

        encoder = encoders.OneHotEncoder(impute_missing=True, handle_unknown='error')
        encoder.fit(X)
        self.assertRaises(ValueError, encoder.transform, X_t)

        encoder = encoders.OneHotEncoder(impute_missing=True, handle_unknown='ignore')
        result = encoder.fit(X).transform(X_t)
        self.assertTrue(result.isnan().any().any())

Just note that it is merely a proposal. And that OrdinalEncoder currently returns 0 when set to impute.

@janmotl janmotl added the bug label Oct 24, 2018
@JohnnyC08
Copy link
Contributor

@janmotl Is it worth continuing to maintain impute_missing? The code itself says the field will eventually be deprecated. What are your thoughts on removing it and using just the handle_unknown field?

@janmotl
Copy link
Collaborator

janmotl commented Oct 24, 2018

@JohnnyC08, @wdm0006, @jkleint We have to come with a plan how to deal with:

  1. New categorical values in the scoring data.
  2. Missing values (like NaNs) in the data.

My proposal:

  1. Handle treatment of new categorical values in the scoring data exclusively with handle_unknown argument. The default value would be 'impute' as is now. Some encoders, like HashingEncoder, are not going to have this argument because they can naturally handle new categories.
  2. Missing values will be automatically handled by the encoders such way that no missing value, be it in the training or scoring data, appears at the output of the encoder. Reasoning: If a downstream model cannot handle a mixture of numerical and categorical features, it is likely that the model cannot deal with missing values. And encoders can frequently deal better with missing values than simple modus/mean imputation, which would be commonly used.

If it turns out that we need a setting for missing value treatment, it will be encoder specific and possibly controlled with an argument like handle_missing in parallel to handle_unknown.

Implementation of the proposal would lead into removing impute_missing argument. And we would have to introduce indicator columns into BackwardDifferenceEncoder, HelmertEncoder, PolynomialEncoder and SumEncoder, which would indicate presence of NaNs in the features.

@jkleint
Copy link

jkleint commented Oct 24, 2018

In practice, does that imply that e.g. OrdinalEncoder would add two extra values to those present in the data, one for NaNs and one for new categorical values?

@janmotl
Copy link
Collaborator

janmotl commented Oct 24, 2018

Yes.

TargetEncoder is in both scenarios going to return the global target average.
WOEEncoder is in both scenarios going to return 0 (this represents no information in WOE).
OneHotEncoder may populate -1 column with 1 when a NaN is observed. And new categories may generate all-zero row (that is the the current behaviour).

Alternative options:

  1. Keep handling of NaNs as it is and just document it.
  2. Always propagate NaNs to the output.

The first alternative option makes swapping of one encoder for another potentially a risky operation. The second alternative option keeps things modular - that is nice. And it simplifies the implementation of the encoders.

@JohnnyC08
Copy link
Contributor

Ok, I like the proposal you list here #92 (comment)

Lets talk about what happens for each value of handle_unknown

  1. impute
    i. An indicator column will be added corresponding to unseen values in the test set, nan or otherwise

  2. ignore
    i. If there are nan in the training set, then the transformation is constructed without taking those into account.

  3. error
    i. Throw a value error if there are missing values in the data set

So, for the ordinal encoder, a value for nan would only be added if there is nan in the training set.
To start on this, do you think it's worth taking care of #144 first?

@janmotl
Copy link
Collaborator

janmotl commented Oct 26, 2018

I came to the conclusion that we have to decouple treatment of missing values and unknown values. If they are independent, it will be easier to implement new treatments. And it will be easier to port them from one encoder to another.

What does it mean? Each encoder will take 2 arguments: handle_unknown and handle_missing. Even HahingEncoder will have handle_unknown. It's just that it is going to have a default value for handle_unknown and no other option. Reasoning: Each encoder will have a place in the documentation where it describes how it treats missing/unknown values.

Some possible values for handle_unknown (note that not all of them have to be implemented for each encoder - they are here just to illustrate the range of possibilities that the change allows):

  1. error: Raises valueError when a new value (unobserved during training) is encountered during the encoding.
  2. ignore: Returns NaN.
  3. value: Returns some value. TargetEncoder returns global target average. OrdinalEncoder returns -1. OneHotEncoder returns a row full of 0. HashingEncoder returns a hash of the value.
  4. rare_group: Rare values in the training data can be grouped into a single group. New values encountered during scoring will be assigned into this group. The nice thing is that the downstream model will encounter encoded missing values already during the training phase - not only during the scoring.

Some possible values for handle_missing:

  1. error: Raises valueError when a missing value is encountered (be it during training or transforming). Possibly useful only during the encoder development.
  2. ignore: Propagates the missing value to the output. The exact type of the missing value NaN/None/NaT/... is not preserved simply because NaT can appear only in temporal attributes, but the encoded output is numerical.
  3. value: OrdinalEncoder returns -2. HashEncoder returns -2.
  4. indicator: Creates an indicator column that takes value 1 when the input is missing, otherwise 0. Essentially, the missing value is treated as another category. The only difference is that the indicator column is created even when no missing value is encountered in the training data. Note that drop_invariant removes indicator columns that are invariant during the training.

Justification of the special values:

  1. Currently we use only -1. Since negative values are traditionally used as sentinel values, it makes sense to continue to -2. Then to -3 and so on.
  2. When handle_missing=value and handle_unknown=value is set, missing values and unknown values should map to two different values.

Justification of having so many options:

  1. Quite possibly the selection of the best (unknown/missing) treatment depends on the data (like in No free lunch theorem).
  2. Even if there was the best treatment for each encoder, I do not know such treatments.
  3. Even if I have known such treatments, people would like to see empirical results from benchmarks.

Possible alterations:

  1. I opted to use value keyword in place of impute because value is a superset of impute. But the proposed behaviour corresponds to constant/mean imputation (with exception of HashEncoder for unknown value). Hence, I leave it up to the implementor to pick the keyword(s).
  2. ignore keyword can be replaced with something more specific like return_nan.

@jkleint, @JohnnyC08 What are your opinions about the proposal?
@JohnnyC08 Go ahead with OrdinalEncoder.

@JohnnyC08
Copy link
Contributor

@janmotl Here are my thoughts

So, handle_unknown will be testing behavior only settings.

  1. error- makes sense and should be easy to implement
  2. ignore- I do agree that return_nan is better
  3. value- makes sense, let the encoder decide what to do
  4. rare group- I think this should only work if there are settings set during training to determine what is and is not a rare group and we should raise a ValueError if handle_unknown is set to this and there are no rare group train settings. We can possibly circle around and implement in a second pass.

Here are my thoughts for handle_missing:

  1. error - Throw an error during train/test should be easy to implement.
  2. ignore- I get propagating nan to the output when we are transforming a single column, ex TargetEncoder, but what about encoders that output multiple columns, ex OneHotEncoder or HelmertEncoder. Should we set a row of nan?
  3. value- I think encoders that that transform a single column should allow value. I don't think it makes sense for encoders that output multiple columns.
  4. Indicator- Similar to above, should only be an option for encoders that output multiple columns.

When we start implementing, we may want to tackle the OrdinalEncoder first because so many encoders use it. Then, we can circle around to OneHotEncoder to tackle this issue.

What do you think of my interpretations?

@janmotl
Copy link
Collaborator

janmotl commented Oct 27, 2018

@JohnnyC08 I agree.

  1. rare_group should be implemented later on. But if a column is not invariant, you can always define some category as "rare". But if you like to have some threshold on rarenes, let's say 5%, and no category satisfies this threshold, you may just define a new category value (as OrdinalEncoder does with handle_unknown='value'). Justification: If you are encoding 1000 columns at once, the probability that at least one of them does not contain a rare category approaches certainty. Returning an error in such case would limit usability of the treatment.
  2. handle_missing=ignore I agree. We should set whole row to NaN.
  3. handle_missing=value I agree.
  4. handle_missing=indicator I agree.

@JohnnyC08
Copy link
Contributor

Ok, I'm gonna take a pass at the OrdinalEncoder first because many of the encoders rely on it

@JohnnyC08
Copy link
Contributor

JohnnyC08 commented Oct 29, 2018

@janmotl I've rethought it and I have revised what I think.

So, handle_unknown should have

  1. error - Raise ValueError when new value is encountered during transform
  2. return_nan - Return nan as either a single value, for encoders that transform the column, or as a row of nan, for encoders that encode multiple columns,
  3. value - Let the encoder decide how to handle it, so encoders like ordinal encoder can preserve a sentinel value, say -2, and encoders, such as the Helmert Encoder, can return a row of all zeroes
  4. indicator - Only supported by encoders that output multiple columns. Now, we can have indicator columns for unknowns and indicator columns for missing.

I have been implementing handle_missing and I have several conflicting thoughts.

  1. value would just lead the encoder to treat missing values as a category in fit and if it's a new value in transform, then shouldn't we just let handle_unknown determine that?

  2. return_nan Are we thinking of returning nan at fit and transform time?

  3. indicator It's similar to value, if nan is part of the training set, then do we really need to add an indicator column?

I think part of my confusion is that for handle_unknown everything is transform time except indicator which would need to be set during fit time, while most of handle_missing needs changes during fit time.

@janmotl
Copy link
Collaborator

janmotl commented Oct 29, 2018

To conflicting thoughts:

  1. value This is a dangerous proposal. What if handle_unknown=error? Then when we encounter NaN during the fit time, we will return a non-NaN value, and an error during the transform time? That is strange.

  2. return_nan Yes. This way we give the user an opportunity to handle missing values any way they want to.

  3. indicator We do not want to duplicate the same column. Either encode NaN into an indicator column, or as a regular value into it's own column. But not both. Just be careful about the naming convention. I would expect that no matter whether the training set contains a NaN, I always get the same column names (assuming the count of non-NaN categories does not change). This property may simplify encoder usage in cross-validation.

Additional proposal:

  1. I propose to rename indicator to indicator_column to make sure that no one interprets it as "inidcator_value".

@xthomas8888
Copy link

xthomas8888 commented Nov 1, 2018

Newbie to encoding but actively watching this issue due to work.

In my opinions, there can be two ways of encoding 'nan' and unknown/unseen labels. First one is to allocate 'nan' and unseen labels all into the '-1' category during encoding; second one is to encode 'nan' as a category different from '-1' (get_dummies(dummy_na=True)).

When I was using this package, similar to @mark-mediware's case, in my test data, there are unseen levels, but they are not encoded under '-1' category which I expect to see them there. So, just want to understand how I can properly encode test data with both missing values and unseen values when I implement machine learning? BTW, what is the timeline to expect a new version of this category_encoders package? Thanks!

@janmotl
Copy link
Collaborator

janmotl commented Nov 1, 2018

@xthomas8888 There are many ways how to deal with missing and unknown/unseen values (btw. I wouldn't be against renaming handle_unknown into handle_unseen) in machine learning. It is possibly because each downstream model prefers slightly different treatments.

For example, a reasonably good treatment of missing values for regression models is to impute the missing value with the column average and introduce indicator column. If the value is missing completely at random, imputation with the average is sufficient and indicator column is going to have a weight insignificantly different from zero. On the other end, if the missingness is predictive, the presence of the indicator column is going to improve the model's accuracy.

If the unseen values are unseen because they are rare, it makes sense to assign them into the "rare group" as discussed above. But if the unseen values are missing because the distribution of the data is changing in time, you had better to use online learning and models that can deal with concept drift.

@janmotl
Copy link
Collaborator

janmotl commented Jan 20, 2019

Solved by @JohnnyC08

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
Projects
None yet
Development

No branches or pull requests

7 participants