Skip to content

Ordinal encoder support new handle unknown handle missing #153

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Conversation

JohnnyC08
Copy link
Contributor

@JohnnyC08 JohnnyC08 commented Nov 2, 2018

Here is the first pass at making Ordinal Encoder support the fields handle_unknown and handle_missing as described at #92.

Lets go through the fields and their logic.

handle_unknown

  1. value
    • unknown values go to -1 at transform time
  2. error
    • throw ValueError if encounter new categories during transform time
  3. return_nan
    • at transform time, return nan

Ok, now handle_missing has configurations for each setting depending on if nan is present at fit time.

handle_missing

  1. value
    • Nan present at fit time-> nan is treated as category
    • Nan not present at fit time -> transform returns -2
  2. return_nan
    • fit add -2 mapping and at transform return -2 with nan
  3. error
    • At fit or transform, throw error

Ok, for a total implementation every encoder will have to be changed. What do we want to do avoiding gigantic Pull Requests? Have a long lived feature branch?

Ok thoughts,

  1. I am going to implement cucumber tests for the handle_unknown and .handle_missing because trying to keep it all straight in my head is difficult.
  2. I need to go through inverse transform and check it against every new setting.
  3. My implementation for return_nan make processing in the downstream encoders more difficult because we are mapping nan to -2.
  4. The relationship between value and indicator for the multi-column encoders and the output of the ordinal encoder currently confuses me. I am going to sit down and write it all out so I know what should lead to what.
  5. Check the changes to the test_ordinal_dist test in test_ordinal. Why was None not being treated as a category?

Tell me what you think and I can get started o the other encoders.

@janmotl
Copy link
Collaborator

janmotl commented Nov 2, 2018

Regarding test_ordinal_dist, I checked an older version of OrdinalEncoder.

It returns:

0 -1
1  1

My guess is that that along the way, we incremented the returned values by one. But we also tried to satisfy the test on the last column. Hence, we ended up with the current situation.

My proposal: Update the test to assert the whole result, not just the last column. And make it insensitive to the exact values:

    def test_ordinal_dist(self):
        data = np.array([
            ['apple', 'lemon'],
            ['peach', None]
        ])
        encoder = encoders.OrdinalEncoder(impute_missing=True)
        result = encoder.fit_transform(data)
        self.assertEqual(2, len(result[0].unique()), "We expect two unique values in the column")
        self.assertEqual(2, len(result[1].unique()), "We expect two unique values in the column")
        self.assertFalse(np.isnan(result.values[1, 1]))

        encoder = encoders.OrdinalEncoder(impute_missing=False)
        result = encoder.fit_transform(data)
        self.assertEqual(2, len(result[0].unique()), "We expect two unique values in the column")
        self.assertEqual(2, len(result[1].unique()), "We expect two unique values in the column")
        self.assertTrue(np.isnan(result.values[1, 1]))

I do not care much whether we count from 1 or 0. But of course, it is better to be consistent.

@janmotl
Copy link
Collaborator

janmotl commented Nov 2, 2018

It looks good. In one test I would maybe be more specific:

    def test_handle_missing_error(self):
        non_null = pd.DataFrame({'city': ['chicago', 'los angeles'], 'color':['red', np.nan]})  # only 'city' column is going to be transformed
        has_null = pd.DataFrame({'city': ['chicago', np.nan], 'color':['red', np.nan]})
        y = pd.Series([1, 0])

        # TODO - implement for all encoders
        for encoder_name in ['OrdinalEncoder']:
            with self.subTest(encoder_name=encoder_name):

                enc = getattr(encoders, encoder_name)(handle_missing='error', cols='city')
                with self.assertRaises(ValueError):
                    enc.fit(has_null, y)

                enc.fit(non_null, y)    # we raise an error only if a missing value is in one of the transformed columns
                with self.assertRaises(ValueError):
                    enc.transform(has_null)

@JohnnyC08
Copy link
Contributor Author

@janmotl I've been rethinking having settings for value and indicator for encoders that output multiple columns.

The current behavior is for value to add a new column, so I think I only need settings for indicator.

Now, I am not sure I have preserved the ignore functionality, so can you quickly take me through what the expected behavior of ignore is?

@janmotl
Copy link
Collaborator

janmotl commented Nov 4, 2018

value should not add new columns. Only indicator adds new columns.

We came to the conclusion that ignore will be called return_nan. Tests:

    def test_handle_missing_return_nan_train(self):
        X = pd.DataFrame({'city': ['chicago', 'los angeles', None]})
        y = pd.Series([1, 0, 1])

        # TODO - implement for all encoders
        for encoder_name in ['OrdinalEncoder']:
            with self.subTest(encoder_name=encoder_name):
                enc = getattr(encoders, encoder_name)(handle_missing='return_nan')
                result = enc.fit_transform(X, y)
                self.assertTrue(pd.isna(result['city'][2]))

    def test_handle_missing_return_nan_test(self):
        X = pd.DataFrame({'city': ['chicago', 'los angeles', 'chicago']})
        X_t = pd.DataFrame({'city': ['chicago', 'los angeles', None]})
        y = pd.Series([1, 0, 1])

        # TODO - implement for all encoders
        for encoder_name in ['OrdinalEncoder']:
            with self.subTest(encoder_name=encoder_name):
                enc = getattr(encoders, encoder_name)(handle_missing='return_nan')
                result = enc.fit(X, y).transform(X_t)
                self.assertTrue(pd.isna(result['city'][2]))

@JohnnyC08
Copy link
Contributor Author

Ok yeah, I blanked that.

I'm making great progress on Helmert Encoder. For the return_nan case, I'm leaving the intercept of 1. We can change that later if we wish.

@JohnnyC08
Copy link
Contributor Author

@janmotl I am one test away from having a working test suite for one of the multiple column encoders and I am trying to understand why what the test is asserting to be true is true.

The test is test_inverse_transform where there is a comment that states that an exception should not be raised for the BinaryEncoder when a new value is introduced? Why is that the case? If I run the following test

    def test_inverse_transform_new_value(self):
        train = pd.DataFrame({'city': ['Chicago', 'stl']})
        test = pd.DataFrame({'city': ['Chicago', 'la']})

        enc = encoders.BinaryEncoder()
        enc.fit(train)
        result = enc.transform(test)
        test_inv = enc.inverse_transform(result)

        assert test.equals(test_inv)

I see that the second value in the data frame is set to nan. Why is this ok behavior?

@janmotl
Copy link
Collaborator

janmotl commented Nov 8, 2018

I agree that the following part of test_inverse_transform can/should be removed:

                # when a new value is encountered, do not raise an exception
                enc = getattr(encoders, encoder_name)(verbose=1, cols=cols)
                enc.fit(X, y)
                _ = enc.inverse_transform(enc.transform(X_t_extra))

And the behaviour of inverse_transform should possibly depend on handle_unknown. Alternatively, inverse_transform could take an argument whether to perform "best effort reconstruction" or "raise an exception when we cannot perfectly reconstruct the original".

I do not see a reason why BinaryEncoder should behave differently from other encoders. HashingEncoder - sure, this one is different. But BinaryEncoder, not so much.

… encoder changes for the new handle missing and handle unknown
@JohnnyC08
Copy link
Contributor Author

@janmotl

You can check the latest commit to see my changes for the encoders that support multiple columns.

The biggest thing I implemented that I am going to put serious thought into doing differently are the places where I put

if handle_missing == 'value':
  del values[np.nan]

I am going to find a way to do that differently.

Tell me what you think.

@janmotl
Copy link
Collaborator

janmotl commented Nov 9, 2018

@JohnnyC08 By glancing, it looks good.

This month, I am reconstructing my flat and it is taxing me. Hence, I am not able to provide a deep review. Keep going.

@janmotl
Copy link
Collaborator

janmotl commented Nov 12, 2018

You are right. It looks like Helmert Encoder does not pass:

    def test_handle_missing_return_nan_fit(self):
        X = pd.DataFrame({'city': ['chicago', 'los angeles', None]})
        y = pd.Series([1, 0, 1])

        # TODO - implement for all encoders
        for encoder_name in encoders.__all__:
            with self.subTest(encoder_name=encoder_name):
                enc = getattr(encoders, encoder_name)(handle_missing='return_nan')
                result = enc.fit_transform(X, y)
                self.assertTrue(
                    pd.isna(result.iloc[2]).any())  # Whole row should contain nans. But some encoders may also return an intercept

while Ordinal Encoder passes the test.

@JohnnyC08
Copy link
Contributor Author

@janmotl Alright we have the error functionality working for all the encoders.

I still have a lot of work to do to check various edge cases for all the encoders as well as check up on the inverse encoding.

We're getting there 😄

@JohnnyC08
Copy link
Contributor Author

ok @janmotl We're continuing on!

We now have return_nan working for the train and test cases for handle_missing. We also have return_nan working for handle_unknown. The next steps

  1. handle_unknown to be value
  2. handle_missing to be value in the train and test cases and when nan is and is not in the original training set.
  3. Tackle indicator for handle_unknown
  4. Tackle indicator for handle_missing in the train and test cases when nan is and is not in the training set.
  5. Work on the inverse transform behaviors.

I introduced vectorization in the one hot encoder so that we build a contrast matrix and then use .loc on it to build the result data frame. So, we should see a good speed up from that.

I also introduced the ordinal encoder into the WOE and Target Encoder because it makes dealing with unknown values trivial. Tell me what you think. In the Leave One Out Encoder I have an example of how I could do it without the ordinal encoder, but I have to store two boolean series which could hold a lot of memory if the data frame to transform is big.

We're getting there and as I refactor the encoders I see a lot of room of abstraction which would make adding new features like this easier. We can look to tackle it some time after this is done. 😄

@janmotl
Copy link
Collaborator

janmotl commented Nov 30, 2018

  1. I am happy to see OHE vectorized.
  2. While ordinal encoding in WOE and Target Encoding is an overhead, it simplifies the development and maintenance -> I am ok with that.
  3. Introduction of get_feature_names() caused conflicts.

@JohnnyC08
Copy link
Contributor Author

Ok, I have all the pieces ready except for the inverse transform piece.

I also had to update the pandas version from 0.20.1 to 0.21.1 to get the fix for pandas-dev/pandas#17006 .

@janmotl
Copy link
Collaborator

janmotl commented Dec 22, 2018

Nice work on the code, documentation and tests.

Some typos (some of them are old):

  1. In target encoder:
    ...defaults to 'valie'...
  2. In backward difference, basen, binary encoder,...:
    This can causes unexpected... -> This can cause unexpected...

@JohnnyC08
Copy link
Contributor Author

@janmotl

I've started to tackle the inverse transform piece and some things I thought about.

It isn't worth considering if either field, handle_unknown or handle_missing, is set to error because we will throw the ValueError during the fit and transform process. So that leaves us with value, return_nan, and indicator.

The only encoder that transforms to a single column with inverse transform function defined is the Ordinal Encoder. So here are what I think the rules should be. Given we have the available settings return_nan and value, then

  1. If both fields are set to value, then calculate the inverse because we define unknown as -1 and missing as -2.
  2. If both fields are set to return_nan and there are nans, then raise a ValueError because the requirements for an injective function are violated.
  3. If handle_missing is return_nan and handle_unknown is value, then calculate the inverse because we are able to calculate the inverse.
  4. If handle_missing is value and handle_unknown is return_nan and there are nans in the input, then raise a ValueError because we can't map back the unknown categories.

For the encoders that output multiple columns we have the available settings value, return_nan, and indicator.

  1. If both fields are set to value, then raise a ValueError because both settings use the same fill values, so we violate the constraints of an injective function.
  2. If handle_unknown is return_nan and handle_missing is some other non return_nan setting and there are nans, then raise a ValueError because we can't map back to the unknown.
  3. If handle_missing is return_nan and handle_unknown is some non return_nan setting and there a no unknowns, then calculate the inverse.
  4. Both are return_nan and there are nans, then raise value error.
  5. handle unknown is indicator and there are unknowns, then raise value error because we can't map back from unknowns.
  6. Both are indicator with no unknowns, then calculate inverse.

Am I missing anything?

@janmotl
Copy link
Collaborator

janmotl commented Dec 23, 2018

There are two moments when we can raise an exception:

  1. During the argument validation.
  2. During the method execution.

I believe we should always try to execute the inverse transformation because the data can actually be without any missing or new value. In other words, with nice data we can always successfully perform the inverse transformation regardless of the setting of handle_unknown and handle_missing.

When we encounter a value, which codes a new value, during the method execution, raise an exception. Hence, with each setting but handle_unknown=error we may raise an exception.

A silly pseudocode illustrating the check in the inverse method for Ordinal Encoder:

for row in range(len(X_t)):
    if X_t[row, col]==self.handle_unknown_symbol:    # e.g.: NaN or -1. The check can be skipped when handle_unknown==error 
      raise exception
  # continue as commonly

@JohnnyC08
Copy link
Contributor Author

Ok how about

if self.handle_missing == 'value' or self.handle_unknown == 'value:
  if X[col].isin([-1]).any():
    raise ValueError()

if self.handle_missing == 'return_nan' or self.handle_unknown == 'return_nan':
  if X[col].isnull().any():
    raise ValueError()

# Execute inverse transform

That would be for encoders that share a fill value and we know ordinal encoder will have to have a separate check for handle_missing for -2.

However, I don't see anything around the return_nan behavior.

Tell me what you think.

@janmotl
Copy link
Collaborator

janmotl commented Dec 27, 2018

I think that it is ok when the inverse method in ordinal encoder return NaNs. If nothing else, when handle_missing='value' and we encounter -2 in the inverse method, we know that we should replace it with NaN. The only issue is that we do not preserve the type of the missing value (whether it is NaN, None, or something completely different). Hence, two things should be checked:

  1. The documentation of the inverse transform is specific about the the way how it represents missing values (e.g.: that it always returns NaN even if it there was None in the original data).
  2. The unit tests optionally ignores differences between different representations of missing values (possibly by altering test_utils.verify_inverse_transform()).

Of course, it is not always possible to know what to return. For example, when both handle_missing = 'return_nan' and handle_unknown = 'return_nan', bijection is broken. We could perform "the best attempt" and when we encounter a missing value or an unknown value in the inverse method, return NaN. But a simple error message is ok.

The code for ordinal encoder:

if self.handle_unknown == 'value:
  if X[col].isin([-1]).any():
    raise ValueError('-1 value was found. But it is impossible to recover a value that has never been observed during the encoder fitting.') 

if self.handle_missing == 'return_nan' and self.handle_unknown == 'return_nan':
  if X[col].isnull().any():
    raise ValueError('NaN value was found. But it is unknown whether it represents a missing value or an unknown value.')

# Execute inverse transform

The errors could be turned into warnings, which say that a (potential) unknown value (or values) was (were) turned into a NaN (NaNs). But I leave the decision up to you.

@JohnnyC08 JohnnyC08 changed the title WIP: Ordinal encoder support new handle unknown handle missing Ordinal encoder support new handle unknown handle missing Jan 4, 2019
@JohnnyC08
Copy link
Contributor Author

@janmotl

Alright buddy, I have ensured the inverse transform functions will warn when possible and do best case transformations.

That completes everything I can think of that has to do with the new fields. Is there anything else you can think of?

Since there are so many changes, I would feel comfortable with some more tests. How is #117 coming along? Is it something we can use before merging this in?

Please tell me what you think.

@janmotl
Copy link
Collaborator

janmotl commented Jan 4, 2019

@JohnnyC08 I will merge the code. And execute the large benchmark in examples with warnings turned on. If there is no (significant) change in the accuracy and if there is no warning, it will be a good sign. Another reason why to merge the code is because I wrote a few new encoders. And they depend on your changes. And then there is the new MultiHotEncoder, which should conform the new convention (the changed tests) from the beginning.

#117 is not really going to help here. The name of #117 should have really been "parameterized tests". And that is done. The issue is left open only because of unresolved issue with CircleCI. And that's waiting on @wdm0006.

@janmotl janmotl merged commit e3ce76f into scikit-learn-contrib:master Jan 4, 2019
@janmotl
Copy link
Collaborator

janmotl commented Jan 4, 2019

A few things to look at:

  1. OneHotEncoder documentation is missing handle_missing entry.
  2. There is something strange happening in:
    def test_handle_unknown_in_invariant(self):
        encoder = encoders.BackwardDifferenceEncoder()
        X_test = X_t
        X_test.loc[3, 'invariant'] = 'extra_value'
        encoder.fit(X)

        _ = encoder.transform(X_test)  # Empty mapping DataFrame for invariant feature

And in other encoders like HelmertEncoder... But it is possibly an old bug. If fixed, we can just alter create_dataset() to conditionally (on extras=True) generate an extra value in 'invariant' column to test the issue.
3. The documentation in OrdinalEncoder has an inconsistent indentation.

@janmotl
Copy link
Collaborator

janmotl commented Jan 5, 2019

The results of the large benchmark:

fit_runtime
Comment: Vectorization was a success. However, get_feature_names() doubled the runtime for HashingEncoder. That's not nice.

score_runtime
Comment: Here is everything good.

test_auc
Comment: The difference in WOEEncoder is due to abandoning leave-one-out approach. Hence the jump from accuracy of LeaveOneOutEncoder to accuracy of TargetEncoder. I am not sure about what happened to BaseNEncoder.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants