-
Notifications
You must be signed in to change notification settings - Fork 400
Ordinal encoder support new handle unknown handle missing #153
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Ordinal encoder support new handle unknown handle missing #153
Conversation
…nown return nan for transform and inverse transform
…s not a category during fit
Regarding It returns:
My guess is that that along the way, we incremented the returned values by one. But we also tried to satisfy the test on the last column. Hence, we ended up with the current situation. My proposal: Update the test to assert the whole result, not just the last column. And make it insensitive to the exact values: def test_ordinal_dist(self):
data = np.array([
['apple', 'lemon'],
['peach', None]
])
encoder = encoders.OrdinalEncoder(impute_missing=True)
result = encoder.fit_transform(data)
self.assertEqual(2, len(result[0].unique()), "We expect two unique values in the column")
self.assertEqual(2, len(result[1].unique()), "We expect two unique values in the column")
self.assertFalse(np.isnan(result.values[1, 1]))
encoder = encoders.OrdinalEncoder(impute_missing=False)
result = encoder.fit_transform(data)
self.assertEqual(2, len(result[0].unique()), "We expect two unique values in the column")
self.assertEqual(2, len(result[1].unique()), "We expect two unique values in the column")
self.assertTrue(np.isnan(result.values[1, 1])) I do not care much whether we count from 1 or 0. But of course, it is better to be consistent. |
It looks good. In one test I would maybe be more specific: def test_handle_missing_error(self):
non_null = pd.DataFrame({'city': ['chicago', 'los angeles'], 'color':['red', np.nan]}) # only 'city' column is going to be transformed
has_null = pd.DataFrame({'city': ['chicago', np.nan], 'color':['red', np.nan]})
y = pd.Series([1, 0])
# TODO - implement for all encoders
for encoder_name in ['OrdinalEncoder']:
with self.subTest(encoder_name=encoder_name):
enc = getattr(encoders, encoder_name)(handle_missing='error', cols='city')
with self.assertRaises(ValueError):
enc.fit(has_null, y)
enc.fit(non_null, y) # we raise an error only if a missing value is in one of the transformed columns
with self.assertRaises(ValueError):
enc.transform(has_null) |
@janmotl I've been rethinking having settings for The current behavior is for value to add a new column, so I think I only need settings for indicator. Now, I am not sure I have preserved the |
We came to the conclusion that def test_handle_missing_return_nan_train(self):
X = pd.DataFrame({'city': ['chicago', 'los angeles', None]})
y = pd.Series([1, 0, 1])
# TODO - implement for all encoders
for encoder_name in ['OrdinalEncoder']:
with self.subTest(encoder_name=encoder_name):
enc = getattr(encoders, encoder_name)(handle_missing='return_nan')
result = enc.fit_transform(X, y)
self.assertTrue(pd.isna(result['city'][2]))
def test_handle_missing_return_nan_test(self):
X = pd.DataFrame({'city': ['chicago', 'los angeles', 'chicago']})
X_t = pd.DataFrame({'city': ['chicago', 'los angeles', None]})
y = pd.Series([1, 0, 1])
# TODO - implement for all encoders
for encoder_name in ['OrdinalEncoder']:
with self.subTest(encoder_name=encoder_name):
enc = getattr(encoders, encoder_name)(handle_missing='return_nan')
result = enc.fit(X, y).transform(X_t)
self.assertTrue(pd.isna(result['city'][2])) |
Ok yeah, I blanked that. I'm making great progress on Helmert Encoder. For the |
@janmotl I am one test away from having a working test suite for one of the multiple column encoders and I am trying to understand why what the test is asserting to be true is true. The test is
I see that the second value in the data frame is set to nan. Why is this ok behavior? |
I agree that the following part of # when a new value is encountered, do not raise an exception
enc = getattr(encoders, encoder_name)(verbose=1, cols=cols)
enc.fit(X, y)
_ = enc.inverse_transform(enc.transform(X_t_extra)) And the behaviour of I do not see a reason why BinaryEncoder should behave differently from other encoders. HashingEncoder - sure, this one is different. But BinaryEncoder, not so much. |
… encoder changes for the new handle missing and handle unknown
You can check the latest commit to see my changes for the encoders that support multiple columns. The biggest thing I implemented that I am going to put serious thought into doing differently are the places where I put
I am going to find a way to do that differently. Tell me what you think. |
@JohnnyC08 By glancing, it looks good. This month, I am reconstructing my flat and it is taxing me. Hence, I am not able to provide a deep review. Keep going. |
You are right. It looks like Helmert Encoder does not pass: def test_handle_missing_return_nan_fit(self):
X = pd.DataFrame({'city': ['chicago', 'los angeles', None]})
y = pd.Series([1, 0, 1])
# TODO - implement for all encoders
for encoder_name in encoders.__all__:
with self.subTest(encoder_name=encoder_name):
enc = getattr(encoders, encoder_name)(handle_missing='return_nan')
result = enc.fit_transform(X, y)
self.assertTrue(
pd.isna(result.iloc[2]).any()) # Whole row should contain nans. But some encoders may also return an intercept while Ordinal Encoder passes the test. |
… rewrote the binary and basen encoders to generate a mapping before hand
@janmotl Alright we have the error functionality working for all the encoders. I still have a lot of work to do to check various edge cases for all the encoders as well as check up on the inverse encoding. We're getting there 😄 |
ok @janmotl We're continuing on! We now have
I introduced vectorization in the one hot encoder so that we build a contrast matrix and then use I also introduced the ordinal encoder into the WOE and Target Encoder because it makes dealing with unknown values trivial. Tell me what you think. In the Leave One Out Encoder I have an example of how I could do it without the ordinal encoder, but I have to store two boolean series which could hold a lot of memory if the data frame to transform is big. We're getting there and as I refactor the encoders I see a lot of room of abstraction which would make adding new features like this easier. We can look to tackle it some time after this is done. 😄 |
|
…ay the handle missing and value params get inherited
… between python and pandas versions
Ok, I have all the pieces ready except for the inverse transform piece. I also had to update the pandas version from |
Nice work on the code, documentation and tests. Some typos (some of them are old):
|
I've started to tackle the inverse transform piece and some things I thought about. It isn't worth considering if either field, The only encoder that transforms to a single column with inverse transform function defined is the Ordinal Encoder. So here are what I think the rules should be. Given we have the available settings
For the encoders that output multiple columns we have the available settings value, return_nan, and indicator.
Am I missing anything? |
There are two moments when we can raise an exception:
I believe we should always try to execute the inverse transformation because the data can actually be without any missing or new value. In other words, with nice data we can always successfully perform the inverse transformation regardless of the setting of When we encounter a value, which codes a new value, during the method execution, raise an exception. Hence, with each setting but A silly pseudocode illustrating the check in the inverse method for Ordinal Encoder: for row in range(len(X_t)):
if X_t[row, col]==self.handle_unknown_symbol: # e.g.: NaN or -1. The check can be skipped when handle_unknown==error
raise exception
# continue as commonly |
Ok how about
That would be for encoders that share a fill value and we know ordinal encoder will have to have a separate check for However, I don't see anything around the Tell me what you think. |
I think that it is ok when the inverse method in ordinal encoder return NaNs. If nothing else, when
Of course, it is not always possible to know what to return. For example, when both The code for ordinal encoder: if self.handle_unknown == 'value:
if X[col].isin([-1]).any():
raise ValueError('-1 value was found. But it is impossible to recover a value that has never been observed during the encoder fitting.')
if self.handle_missing == 'return_nan' and self.handle_unknown == 'return_nan':
if X[col].isnull().any():
raise ValueError('NaN value was found. But it is unknown whether it represents a missing value or an unknown value.')
# Execute inverse transform The errors could be turned into warnings, which say that a (potential) unknown value (or values) was (were) turned into a NaN (NaNs). But I leave the decision up to you. |
…rse transforms and issue warnings when unknowns exist or the bijection is broken
…s broken and test that we get nulls when bijection is broken
…upport-new-handle-unknown-handle-missing
Alright buddy, I have ensured the inverse transform functions will warn when possible and do best case transformations. That completes everything I can think of that has to do with the new fields. Is there anything else you can think of? Since there are so many changes, I would feel comfortable with some more tests. How is #117 coming along? Is it something we can use before merging this in? Please tell me what you think. |
@JohnnyC08 I will merge the code. And execute the large benchmark in examples with warnings turned on. If there is no (significant) change in the accuracy and if there is no warning, it will be a good sign. Another reason why to merge the code is because I wrote a few new encoders. And they depend on your changes. And then there is the new MultiHotEncoder, which should conform the new convention (the changed tests) from the beginning. #117 is not really going to help here. The name of #117 should have really been "parameterized tests". And that is done. The issue is left open only because of unresolved issue with CircleCI. And that's waiting on @wdm0006. |
A few things to look at:
def test_handle_unknown_in_invariant(self):
encoder = encoders.BackwardDifferenceEncoder()
X_test = X_t
X_test.loc[3, 'invariant'] = 'extra_value'
encoder.fit(X)
_ = encoder.transform(X_test) # Empty mapping DataFrame for invariant feature And in other encoders like HelmertEncoder... But it is possibly an old bug. If fixed, we can just alter |
Here is the first pass at making Ordinal Encoder support the fields
handle_unknown
andhandle_missing
as described at #92.Lets go through the fields and their logic.
handle_unknown
Ok, now
handle_missing
has configurations for each setting depending on if nan is present at fit time.handle_missing
Ok, for a total implementation every encoder will have to be changed. What do we want to do avoiding gigantic Pull Requests? Have a long lived feature branch?
Ok thoughts,
handle_unknown
and .handle_missing
because trying to keep it all straight in my head is difficult.return_nan
make processing in the downstream encoders more difficult because we are mapping nan to -2.test_ordinal_dist
test intest_ordinal
. Why was None not being treated as a category?Tell me what you think and I can get started o the other encoders.