Ordinal encoder support new handle unknown handle missing #153

JohnnyC08 · 2018-11-02T05:13:58Z

Here is the first pass at making Ordinal Encoder support the fields handle_unknown and handle_missing as described at #92.

Lets go through the fields and their logic.

handle_unknown

value
- unknown values go to -1 at transform time
error
- throw ValueError if encounter new categories during transform time
return_nan
- at transform time, return nan

Ok, now handle_missing has configurations for each setting depending on if nan is present at fit time.

handle_missing

value
- Nan present at fit time-> nan is treated as category
- Nan not present at fit time -> transform returns -2
return_nan
- fit add -2 mapping and at transform return -2 with nan
error
- At fit or transform, throw error

Ok, for a total implementation every encoder will have to be changed. What do we want to do avoiding gigantic Pull Requests? Have a long lived feature branch?

Ok thoughts,

I am going to implement cucumber tests for the handle_unknown and .handle_missing because trying to keep it all straight in my head is difficult.
I need to go through inverse transform and check it against every new setting.
My implementation for return_nan make processing in the downstream encoders more difficult because we are mapping nan to -2.
The relationship between value and indicator for the multi-column encoders and the output of the ordinal encoder currently confuses me. I am going to sit down and write it all out so I know what should lead to what.
Check the changes to the test_ordinal_dist test in test_ordinal. Why was None not being treated as a category?

Tell me what you think and I can get started o the other encoders.

…nown return nan for transform and inverse transform

…s not a category during fit

janmotl · 2018-11-02T10:15:45Z

Regarding test_ordinal_dist, I checked an older version of OrdinalEncoder.

It returns:

0 -1
1  1

My guess is that that along the way, we incremented the returned values by one. But we also tried to satisfy the test on the last column. Hence, we ended up with the current situation.

My proposal: Update the test to assert the whole result, not just the last column. And make it insensitive to the exact values:

    def test_ordinal_dist(self):
        data = np.array([
            ['apple', 'lemon'],
            ['peach', None]
        ])
        encoder = encoders.OrdinalEncoder(impute_missing=True)
        result = encoder.fit_transform(data)
        self.assertEqual(2, len(result[0].unique()), "We expect two unique values in the column")
        self.assertEqual(2, len(result[1].unique()), "We expect two unique values in the column")
        self.assertFalse(np.isnan(result.values[1, 1]))

        encoder = encoders.OrdinalEncoder(impute_missing=False)
        result = encoder.fit_transform(data)
        self.assertEqual(2, len(result[0].unique()), "We expect two unique values in the column")
        self.assertEqual(2, len(result[1].unique()), "We expect two unique values in the column")
        self.assertTrue(np.isnan(result.values[1, 1]))

I do not care much whether we count from 1 or 0. But of course, it is better to be consistent.

janmotl · 2018-11-02T16:35:04Z

It looks good. In one test I would maybe be more specific:

    def test_handle_missing_error(self):
        non_null = pd.DataFrame({'city': ['chicago', 'los angeles'], 'color':['red', np.nan]})  # only 'city' column is going to be transformed
        has_null = pd.DataFrame({'city': ['chicago', np.nan], 'color':['red', np.nan]})
        y = pd.Series([1, 0])

        # TODO - implement for all encoders
        for encoder_name in ['OrdinalEncoder']:
            with self.subTest(encoder_name=encoder_name):

                enc = getattr(encoders, encoder_name)(handle_missing='error', cols='city')
                with self.assertRaises(ValueError):
                    enc.fit(has_null, y)

                enc.fit(non_null, y)    # we raise an error only if a missing value is in one of the transformed columns
                with self.assertRaises(ValueError):
                    enc.transform(has_null)

JohnnyC08 · 2018-11-03T22:42:05Z

@janmotl I've been rethinking having settings for value and indicator for encoders that output multiple columns.

The current behavior is for value to add a new column, so I think I only need settings for indicator.

Now, I am not sure I have preserved the ignore functionality, so can you quickly take me through what the expected behavior of ignore is?

janmotl · 2018-11-04T14:51:46Z

value should not add new columns. Only indicator adds new columns.

We came to the conclusion that ignore will be called return_nan. Tests:

    def test_handle_missing_return_nan_train(self):
        X = pd.DataFrame({'city': ['chicago', 'los angeles', None]})
        y = pd.Series([1, 0, 1])

        # TODO - implement for all encoders
        for encoder_name in ['OrdinalEncoder']:
            with self.subTest(encoder_name=encoder_name):
                enc = getattr(encoders, encoder_name)(handle_missing='return_nan')
                result = enc.fit_transform(X, y)
                self.assertTrue(pd.isna(result['city'][2]))

    def test_handle_missing_return_nan_test(self):
        X = pd.DataFrame({'city': ['chicago', 'los angeles', 'chicago']})
        X_t = pd.DataFrame({'city': ['chicago', 'los angeles', None]})
        y = pd.Series([1, 0, 1])

        # TODO - implement for all encoders
        for encoder_name in ['OrdinalEncoder']:
            with self.subTest(encoder_name=encoder_name):
                enc = getattr(encoders, encoder_name)(handle_missing='return_nan')
                result = enc.fit(X, y).transform(X_t)
                self.assertTrue(pd.isna(result['city'][2]))

JohnnyC08 · 2018-11-05T22:19:10Z

Ok yeah, I blanked that.

I'm making great progress on Helmert Encoder. For the return_nan case, I'm leaving the intercept of 1. We can change that later if we wish.

JohnnyC08 · 2018-11-08T01:46:04Z

@janmotl I am one test away from having a working test suite for one of the multiple column encoders and I am trying to understand why what the test is asserting to be true is true.

The test is test_inverse_transform where there is a comment that states that an exception should not be raised for the BinaryEncoder when a new value is introduced? Why is that the case? If I run the following test

    def test_inverse_transform_new_value(self):
        train = pd.DataFrame({'city': ['Chicago', 'stl']})
        test = pd.DataFrame({'city': ['Chicago', 'la']})

        enc = encoders.BinaryEncoder()
        enc.fit(train)
        result = enc.transform(test)
        test_inv = enc.inverse_transform(result)

        assert test.equals(test_inv)

I see that the second value in the data frame is set to nan. Why is this ok behavior?

janmotl · 2018-11-08T04:50:40Z

I agree that the following part of test_inverse_transform can/should be removed:

                # when a new value is encountered, do not raise an exception
                enc = getattr(encoders, encoder_name)(verbose=1, cols=cols)
                enc.fit(X, y)
                _ = enc.inverse_transform(enc.transform(X_t_extra))

And the behaviour of inverse_transform should possibly depend on handle_unknown. Alternatively, inverse_transform could take an argument whether to perform "best effort reconstruction" or "raise an exception when we cannot perfectly reconstruct the original".

I do not see a reason why BinaryEncoder should behave differently from other encoders. HashingEncoder - sure, this one is different. But BinaryEncoder, not so much.

… encoder changes for the new handle missing and handle unknown

JohnnyC08 · 2018-11-08T05:38:54Z

@janmotl

You can check the latest commit to see my changes for the encoders that support multiple columns.

The biggest thing I implemented that I am going to put serious thought into doing differently are the places where I put

if handle_missing == 'value':
  del values[np.nan]

I am going to find a way to do that differently.

Tell me what you think.

janmotl · 2018-11-09T19:43:30Z

@JohnnyC08 By glancing, it looks good.

This month, I am reconstructing my flat and it is taxing me. Hence, I am not able to provide a deep review. Keep going.

janmotl · 2018-11-12T18:21:03Z

You are right. It looks like Helmert Encoder does not pass:

    def test_handle_missing_return_nan_fit(self):
        X = pd.DataFrame({'city': ['chicago', 'los angeles', None]})
        y = pd.Series([1, 0, 1])

        # TODO - implement for all encoders
        for encoder_name in encoders.__all__:
            with self.subTest(encoder_name=encoder_name):
                enc = getattr(encoders, encoder_name)(handle_missing='return_nan')
                result = enc.fit_transform(X, y)
                self.assertTrue(
                    pd.isna(result.iloc[2]).any())  # Whole row should contain nans. But some encoders may also return an intercept

while Ordinal Encoder passes the test.

… rewrote the binary and basen encoders to generate a mapping before hand

JohnnyC08 · 2018-11-18T01:37:58Z

@janmotl Alright we have the error functionality working for all the encoders.

I still have a lot of work to do to check various edge cases for all the encoders as well as check up on the inverse encoding.

We're getting there 😄

…_missing

JohnnyC08 · 2018-11-27T05:12:28Z

ok @janmotl We're continuing on!

We now have return_nan working for the train and test cases for handle_missing. We also have return_nan working for handle_unknown. The next steps

handle_unknown to be value
handle_missing to be value in the train and test cases and when nan is and is not in the original training set.
Tackle indicator for handle_unknown
Tackle indicator for handle_missing in the train and test cases when nan is and is not in the training set.
Work on the inverse transform behaviors.

I introduced vectorization in the one hot encoder so that we build a contrast matrix and then use .loc on it to build the result data frame. So, we should see a good speed up from that.

I also introduced the ordinal encoder into the WOE and Target Encoder because it makes dealing with unknown values trivial. Tell me what you think. In the Leave One Out Encoder I have an example of how I could do it without the ordinal encoder, but I have to store two boolean series which could hold a lot of memory if the data frame to transform is big.

We're getting there and as I refactor the encoders I see a lot of room of abstraction which would make adding new features like this easier. We can look to tackle it some time after this is done. 😄

janmotl · 2018-11-30T13:32:16Z

I am happy to see OHE vectorized.
While ordinal encoding in WOE and Target Encoding is an overhead, it simplifies the development and maintenance -> I am ok with that.
Introduction of get_feature_names() caused conflicts.

…ectly

…ay the handle missing and value params get inherited

… between python and pandas versions

JohnnyC08 · 2018-12-22T07:15:25Z

Ok, I have all the pieces ready except for the inverse transform piece.

I also had to update the pandas version from 0.20.1 to 0.21.1 to get the fix for pandas-dev/pandas#17006 .

janmotl · 2018-12-22T10:06:03Z

Nice work on the code, documentation and tests.

Some typos (some of them are old):

In target encoder:
...defaults to 'valie'...
In backward difference, basen, binary encoder,...:
This can causes unexpected... -> This can cause unexpected...

JohnnyC08 · 2018-12-22T23:44:44Z

@janmotl

I've started to tackle the inverse transform piece and some things I thought about.

It isn't worth considering if either field, handle_unknown or handle_missing, is set to error because we will throw the ValueError during the fit and transform process. So that leaves us with value, return_nan, and indicator.

The only encoder that transforms to a single column with inverse transform function defined is the Ordinal Encoder. So here are what I think the rules should be. Given we have the available settings return_nan and value, then

If both fields are set to value, then calculate the inverse because we define unknown as -1 and missing as -2.
If both fields are set to return_nan and there are nans, then raise a ValueError because the requirements for an injective function are violated.
If handle_missing is return_nan and handle_unknown is value, then calculate the inverse because we are able to calculate the inverse.
If handle_missing is value and handle_unknown is return_nan and there are nans in the input, then raise a ValueError because we can't map back the unknown categories.

For the encoders that output multiple columns we have the available settings value, return_nan, and indicator.

If both fields are set to value, then raise a ValueError because both settings use the same fill values, so we violate the constraints of an injective function.
If handle_unknown is return_nan and handle_missing is some other non return_nan setting and there are nans, then raise a ValueError because we can't map back to the unknown.
If handle_missing is return_nan and handle_unknown is some non return_nan setting and there a no unknowns, then calculate the inverse.
Both are return_nan and there are nans, then raise value error.
handle unknown is indicator and there are unknowns, then raise value error because we can't map back from unknowns.
Both are indicator with no unknowns, then calculate inverse.

Am I missing anything?

janmotl · 2018-12-23T17:08:36Z

There are two moments when we can raise an exception:

During the argument validation.
During the method execution.

I believe we should always try to execute the inverse transformation because the data can actually be without any missing or new value. In other words, with nice data we can always successfully perform the inverse transformation regardless of the setting of handle_unknown and handle_missing.

When we encounter a value, which codes a new value, during the method execution, raise an exception. Hence, with each setting but handle_unknown=error we may raise an exception.

A silly pseudocode illustrating the check in the inverse method for Ordinal Encoder:

for row in range(len(X_t)):
    if X_t[row, col]==self.handle_unknown_symbol:    # e.g.: NaN or -1. The check can be skipped when handle_unknown==error 
      raise exception
  # continue as commonly

JohnnyC08 · 2018-12-27T17:59:14Z

Ok how about

if self.handle_missing == 'value' or self.handle_unknown == 'value:
  if X[col].isin([-1]).any():
    raise ValueError()

if self.handle_missing == 'return_nan' or self.handle_unknown == 'return_nan':
  if X[col].isnull().any():
    raise ValueError()

# Execute inverse transform

That would be for encoders that share a fill value and we know ordinal encoder will have to have a separate check for handle_missing for -2.

However, I don't see anything around the return_nan behavior.

Tell me what you think.

janmotl · 2018-12-27T19:51:34Z

I think that it is ok when the inverse method in ordinal encoder return NaNs. If nothing else, when handle_missing='value' and we encounter -2 in the inverse method, we know that we should replace it with NaN. The only issue is that we do not preserve the type of the missing value (whether it is NaN, None, or something completely different). Hence, two things should be checked:

The documentation of the inverse transform is specific about the the way how it represents missing values (e.g.: that it always returns NaN even if it there was None in the original data).
The unit tests optionally ignores differences between different representations of missing values (possibly by altering test_utils.verify_inverse_transform()).

Of course, it is not always possible to know what to return. For example, when both handle_missing = 'return_nan' and handle_unknown = 'return_nan', bijection is broken. We could perform "the best attempt" and when we encounter a missing value or an unknown value in the inverse method, return NaN. But a simple error message is ok.

The code for ordinal encoder:

if self.handle_unknown == 'value:
  if X[col].isin([-1]).any():
    raise ValueError('-1 value was found. But it is impossible to recover a value that has never been observed during the encoder fitting.') 

if self.handle_missing == 'return_nan' and self.handle_unknown == 'return_nan':
  if X[col].isnull().any():
    raise ValueError('NaN value was found. But it is unknown whether it represents a missing value or an unknown value.')

# Execute inverse transform

The errors could be turned into warnings, which say that a (potential) unknown value (or values) was (were) turned into a NaN (NaNs). But I leave the decision up to you.

…rse transforms and issue warnings when unknowns exist or the bijection is broken

…s broken and test that we get nulls when bijection is broken

…upport-new-handle-unknown-handle-missing

JohnnyC08 · 2019-01-04T04:05:09Z

@janmotl

Alright buddy, I have ensured the inverse transform functions will warn when possible and do best case transformations.

That completes everything I can think of that has to do with the new fields. Is there anything else you can think of?

Since there are so many changes, I would feel comfortable with some more tests. How is #117 coming along? Is it something we can use before merging this in?

Please tell me what you think.

janmotl · 2019-01-04T10:32:38Z

@JohnnyC08 I will merge the code. And execute the large benchmark in examples with warnings turned on. If there is no (significant) change in the accuracy and if there is no warning, it will be a good sign. Another reason why to merge the code is because I wrote a few new encoders. And they depend on your changes. And then there is the new MultiHotEncoder, which should conform the new convention (the changed tests) from the beginning.

#117 is not really going to help here. The name of #117 should have really been "parameterized tests". And that is done. The issue is left open only because of unresolved issue with CircleCI. And that's waiting on @wdm0006.

janmotl · 2019-01-04T18:16:43Z

A few things to look at:

OneHotEncoder documentation is missing handle_missing entry.
There is something strange happening in:

    def test_handle_unknown_in_invariant(self):
        encoder = encoders.BackwardDifferenceEncoder()
        X_test = X_t
        X_test.loc[3, 'invariant'] = 'extra_value'
        encoder.fit(X)

        _ = encoder.transform(X_test)  # Empty mapping DataFrame for invariant feature

And in other encoders like HelmertEncoder... But it is possibly an old bug. If fixed, we can just alter create_dataset() to conditionally (on extras=True) generate an extra value in 'invariant' column to test the issue.
3. The documentation in OrdinalEncoder has an inconsistent indentation.

janmotl · 2019-01-05T14:35:39Z

The results of the large benchmark:

Comment: Vectorization was a success. However, get_feature_names() doubled the runtime for HashingEncoder. That's not nice.

Comment: Here is everything good.

Comment: The difference in WOEEncoder is due to abandoning leave-one-out approach. Hence the jump from accuracy of LeaveOneOutEncoder to accuracy of TargetEncoder. I am not sure about what happened to BaseNEncoder.

JohnnyC08 added 6 commits October 28, 2018 21:19

Add tests and logic to ensure ordinal encoder supports the handle unk…

9efba10

…nown return nan for transform and inverse transform

Make all encoders support value over impute

8cac21a

Add handle missing to ordinal encoder

c9ff8e0

Add handle na settings return_nan and value for ordinal encoder

f940bdb

Remove impute_missing field from ordinal encoder

4fff393

Make Ordinal Encoder return -2 at transform time if and only if nan i…

162097c

…s not a category during fit

JohnnyC08 added 2 commits November 2, 2018 15:46

Refactor handle missing error test

b1c0717

In test ordinal dist, check every value

ab105b2

Convert encoders that use multi column outputs to support the ordinal…

8971f18

… encoder changes for the new handle missing and handle unknown

JohnnyC08 added 2 commits November 17, 2018 17:18

Make all encoders handle error for handle unknown and handle missing,…

3526a6d

… rewrote the binary and basen encoders to generate a mapping before hand

When creating expected dataframes set the column order in tests

28c81a8

Added functionality to handle return_nan for train and test in handle…

374ca54

…_missing

JohnnyC08 added 5 commits December 11, 2018 20:45

Make leave one out encoder respect value settings

df9e0ba

Convert backward difference and basen to use value and indicator corr…

fc4917a

…ectly

Convert binary encoder to internally use the base n encoder so that w…

f00e77a

…ay the handle missing and value params get inherited

Have base n with handle unknown indicator

111b9f8

Have helmert check value and indicator and make faster

bdc3fe7

JohnnyC08 added 10 commits December 21, 2018 23:16

Update pandas version for issue in regards to isin for empty series

90fadd7

In travis yaml use pandas version the package uses

eb1bfe7

Use loc to update mapping dataframe to eliminate on copy warning

30a6f62

Fixed doctests

de1af1b

Fix helmert doc tests

a237af4

Fix base n doc test

e2af587

Specify column ordering for refit test so we have consistent behavior…

f7f4155

… between python and pandas versions

Use deep round on polynomial tests to satisfy python2

0ace39b

Use reindex on one hot to remove warning message

a3214d0

Make leave one out test a decimal for python2

aea80fa

JohnnyC08 added 2 commits December 22, 2018 13:26

Fix typos

08d96d3

replace 'ignore' with 'return_nan' in docs

58524d1

wdm0006 mentioned this pull request Dec 25, 2018

1.4.0 Release Organization #159

Closed

JohnnyC08 added 5 commits December 29, 2018 12:01

Convert ordinal encoder inverse transform to do best attempts at inve…

ccfb4d5

…rse transforms and issue warnings when unknowns exist or the bijection is broken

Update oridinal encoder inverse transform documentation

dc5ac1b

Convert inverse transform to use warnings for cases where bijection i…

a4c916b

…s broken and test that we get nulls when bijection is broken

Make test reflect what's in master

109a3d6

Merge remote-tracking branch 'upstream/master' into ordinal-encoder-s…

a98a8cc

…upport-new-handle-unknown-handle-missing

JohnnyC08 changed the title ~~WIP: Ordinal encoder support new handle unknown handle missing~~ Ordinal encoder support new handle unknown handle missing Jan 4, 2019

janmotl merged commit e3ce76f into scikit-learn-contrib:master Jan 4, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ordinal encoder support new handle unknown handle missing #153

Ordinal encoder support new handle unknown handle missing #153

JohnnyC08 commented Nov 2, 2018 •

edited

Loading

janmotl commented Nov 2, 2018

janmotl commented Nov 2, 2018

JohnnyC08 commented Nov 3, 2018

janmotl commented Nov 4, 2018

JohnnyC08 commented Nov 5, 2018

JohnnyC08 commented Nov 8, 2018

janmotl commented Nov 8, 2018

JohnnyC08 commented Nov 8, 2018

janmotl commented Nov 9, 2018

janmotl commented Nov 12, 2018

JohnnyC08 commented Nov 18, 2018

JohnnyC08 commented Nov 27, 2018

janmotl commented Nov 30, 2018

JohnnyC08 commented Dec 22, 2018

janmotl commented Dec 22, 2018

JohnnyC08 commented Dec 22, 2018

janmotl commented Dec 23, 2018

JohnnyC08 commented Dec 27, 2018

janmotl commented Dec 27, 2018

JohnnyC08 commented Jan 4, 2019

janmotl commented Jan 4, 2019

janmotl commented Jan 4, 2019

janmotl commented Jan 5, 2019

Ordinal encoder support new handle unknown handle missing #153

Ordinal encoder support new handle unknown handle missing #153

Conversation

JohnnyC08 commented Nov 2, 2018 • edited Loading

janmotl commented Nov 2, 2018

janmotl commented Nov 2, 2018

JohnnyC08 commented Nov 3, 2018

janmotl commented Nov 4, 2018

JohnnyC08 commented Nov 5, 2018

JohnnyC08 commented Nov 8, 2018

janmotl commented Nov 8, 2018

JohnnyC08 commented Nov 8, 2018

janmotl commented Nov 9, 2018

janmotl commented Nov 12, 2018

JohnnyC08 commented Nov 18, 2018

JohnnyC08 commented Nov 27, 2018

janmotl commented Nov 30, 2018

JohnnyC08 commented Dec 22, 2018

janmotl commented Dec 22, 2018

JohnnyC08 commented Dec 22, 2018

janmotl commented Dec 23, 2018

JohnnyC08 commented Dec 27, 2018

janmotl commented Dec 27, 2018

JohnnyC08 commented Jan 4, 2019

janmotl commented Jan 4, 2019

janmotl commented Jan 4, 2019

janmotl commented Jan 5, 2019

JohnnyC08 commented Nov 2, 2018 •

edited

Loading