Skip to content

Ordinal encoder support new handle unknown handle missing #153

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
42 commits
Select commit Hold shift + click to select a range
9efba10
Add tests and logic to ensure ordinal encoder supports the handle unk…
JohnnyC08 Oct 29, 2018
8cac21a
Make all encoders support value over impute
JohnnyC08 Oct 29, 2018
c9ff8e0
Add handle missing to ordinal encoder
JohnnyC08 Oct 29, 2018
f940bdb
Add handle na settings return_nan and value for ordinal encoder
JohnnyC08 Nov 2, 2018
4fff393
Remove impute_missing field from ordinal encoder
JohnnyC08 Nov 2, 2018
162097c
Make Ordinal Encoder return -2 at transform time if and only if nan i…
JohnnyC08 Nov 2, 2018
b1c0717
Refactor handle missing error test
JohnnyC08 Nov 2, 2018
ab105b2
In test ordinal dist, check every value
JohnnyC08 Nov 2, 2018
8971f18
Convert encoders that use multi column outputs to support the ordinal…
JohnnyC08 Nov 8, 2018
3526a6d
Make all encoders handle error for handle unknown and handle missing,…
JohnnyC08 Nov 18, 2018
28c81a8
When creating expected dataframes set the column order in tests
JohnnyC08 Nov 18, 2018
374ca54
Added functionality to handle return_nan for train and test in handle…
JohnnyC08 Nov 27, 2018
df9e0ba
Make leave one out encoder respect value settings
JohnnyC08 Dec 12, 2018
fc4917a
Convert backward difference and basen to use value and indicator corr…
JohnnyC08 Dec 21, 2018
f00e77a
Convert binary encoder to internally use the base n encoder so that w…
JohnnyC08 Dec 21, 2018
111b9f8
Have base n with handle unknown indicator
JohnnyC08 Dec 21, 2018
bdc3fe7
Have helmert check value and indicator and make faster
JohnnyC08 Dec 21, 2018
1d3aa4f
Add tests for one hot encoder to ensure correct handle missing and ha…
JohnnyC08 Dec 22, 2018
4e1bb8d
Make polynomial encoder handle unknown and missing
JohnnyC08 Dec 22, 2018
ea2d40d
Make sum encoding handle missing and unknown
JohnnyC08 Dec 22, 2018
c00664b
Add tests to check handle missing and handle unknown for woe
JohnnyC08 Dec 22, 2018
276053c
Merge branch 'master' into ordinal-encoder-support-new-handle-unknown…
JohnnyC08 Dec 22, 2018
e5ab10d
Add override return df
JohnnyC08 Dec 22, 2018
3390a62
Fix problems from detached head
JohnnyC08 Dec 22, 2018
3b1517e
See if leave one out tests are fixed
JohnnyC08 Dec 22, 2018
90fadd7
Update pandas version for issue in regards to isin for empty series
JohnnyC08 Dec 22, 2018
eb1bfe7
In travis yaml use pandas version the package uses
JohnnyC08 Dec 22, 2018
30a6f62
Use loc to update mapping dataframe to eliminate on copy warning
JohnnyC08 Dec 22, 2018
de1af1b
Fixed doctests
JohnnyC08 Dec 22, 2018
a237af4
Fix helmert doc tests
JohnnyC08 Dec 22, 2018
e2af587
Fix base n doc test
JohnnyC08 Dec 22, 2018
f7f4155
Specify column ordering for refit test so we have consistent behavior…
JohnnyC08 Dec 22, 2018
0ace39b
Use deep round on polynomial tests to satisfy python2
JohnnyC08 Dec 22, 2018
a3214d0
Use reindex on one hot to remove warning message
JohnnyC08 Dec 22, 2018
aea80fa
Make leave one out test a decimal for python2
JohnnyC08 Dec 22, 2018
08d96d3
Fix typos
JohnnyC08 Dec 22, 2018
58524d1
replace 'ignore' with 'return_nan' in docs
JohnnyC08 Dec 22, 2018
ccfb4d5
Convert ordinal encoder inverse transform to do best attempts at inve…
JohnnyC08 Dec 29, 2018
dc5ac1b
Update oridinal encoder inverse transform documentation
JohnnyC08 Dec 29, 2018
a4c916b
Convert inverse transform to use warnings for cases where bijection i…
JohnnyC08 Jan 4, 2019
109a3d6
Make test reflect what's in master
JohnnyC08 Jan 4, 2019
a98a8cc
Merge remote-tracking branch 'upstream/master' into ordinal-encoder-s…
JohnnyC08 Jan 4, 2019
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -15,10 +15,10 @@ env:
matrix:
# The versions should match the minimal requirements in requirements.txt and setup.py
- DISTRIB="conda" PYTHON_VERSION="2.7" CYTHON_VERSION="0.21"
NUMPY_VERSION="1.11.1" PANDAS_VERSION="0.20.1" PATSY_VERSION="0.4.1"
NUMPY_VERSION="1.11.1" PANDAS_VERSION="0.21.1" PATSY_VERSION="0.4.1"
SCIKIT_VERSION="0.17.1" SCIPY_VERSION="0.17.0" STATSMODELS_VERSION="0.6.1"
- DISTRIB="conda" PYTHON_VERSION="3.5" COVERAGE="true" CYTHON_VERSION="0.23.4"
NUMPY_VERSION="1.11.1" PANDAS_VERSION="0.20.1" PATSY_VERSION="0.4.1"
NUMPY_VERSION="1.11.1" PANDAS_VERSION="0.21.1" PATSY_VERSION="0.4.1"
SCIKIT_VERSION="0.17.1" SCIPY_VERSION="0.17.0" STATSMODELS_VERSION="0.6.1"

install: source ci_scripts/install.sh
Expand Down
86 changes: 60 additions & 26 deletions category_encoders/backward_difference.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,11 +24,13 @@ class BackwardDifferenceEncoder(BaseEstimator, TransformerMixin):
boolean for whether or not to drop columns with 0 variance.
return_df: bool
boolean for whether to return a pandas DataFrame from transform (otherwise it will be a numpy array).
impute_missing: bool
boolean for whether or not to apply the logic for handle_unknown, will be deprecated in the future.
handle_unknown: str
options are 'error', 'ignore' and 'impute', defaults to 'impute', which will impute the category -1. Warning: if
impute is used, an extra column will be added in if the transform matrix has unknown categories. This can causes
options are 'error', 'return_nan' and 'value', defaults to 'value'. Warning: if value is used,
an extra column will be added in if the transform matrix has unknown categories. This can cause
unexpected changes in dimension in some cases.
handle_missing: str
options are 'error', 'return_nan', 'value', and 'indicator', defaults to 'indicator'. Warning: if indicator is used,
an extra column will be added in if the transform matrix has unknown categories. This can cause
unexpected changes in dimension in some cases.

Example
Expand Down Expand Up @@ -82,14 +84,15 @@ class BackwardDifferenceEncoder(BaseEstimator, TransformerMixin):

"""

def __init__(self, verbose=0, cols=None, mapping=None, drop_invariant=False, return_df=True, impute_missing=True, handle_unknown='impute'):
def __init__(self, verbose=0, cols=None, mapping=None, drop_invariant=False, return_df=True,
handle_unknown='value', handle_missing='value'):
self.return_df = return_df
self.drop_invariant = drop_invariant
self.drop_cols = []
self.verbose = verbose
self.mapping = mapping
self.impute_missing = impute_missing
self.handle_unknown = handle_unknown
self.handle_missing = handle_missing
self.cols = cols
self.ordinal_encoder = None
self._dim = None
Expand Down Expand Up @@ -128,22 +131,28 @@ def fit(self, X, y=None, **kwargs):
else:
self.cols = util.convert_cols_to_list(self.cols)

if self.handle_missing == 'error':
if X[self.cols].isnull().any().bool():
raise ValueError('Columns to be encoded can not contain null')

# train an ordinal pre-encoder
self.ordinal_encoder = OrdinalEncoder(
verbose=self.verbose,
cols=self.cols,
impute_missing=self.impute_missing,
handle_unknown=self.handle_unknown
handle_unknown='value',
handle_missing='value'
)
self.ordinal_encoder = self.ordinal_encoder.fit(X)

ordinal_mapping = self.ordinal_encoder.category_mapping

mappings_out = []
for switch in ordinal_mapping:
values = switch.get('mapping').get_values()
column_mapping = self.fit_backward_difference_coding(values)
mappings_out.append({'col': switch.get('col'), 'mapping': column_mapping, })
values = switch.get('mapping')
col = switch.get('col')

column_mapping = self.fit_backward_difference_coding(col, values, self.handle_missing, self.handle_unknown)
mappings_out.append({'col': col, 'mapping': column_mapping, })

self.mapping = mappings_out

Expand Down Expand Up @@ -180,6 +189,10 @@ def transform(self, X, override_return_df=False):

"""

if self.handle_missing == 'error':
if X[self.cols].isnull().any().bool():
raise ValueError('Columns to be encoded can not contain null')

if self._dim is None:
raise ValueError('Must train encoder before it can be used to transform data.')

Expand All @@ -194,6 +207,11 @@ def transform(self, X, override_return_df=False):
return X

X = self.ordinal_encoder.transform(X)

if self.handle_unknown == 'error':
if X[self.cols].isin([-1]).any().any():
raise ValueError('Columns to be encoded can not contain new values')

X = self.backward_difference_coding(X, mapping=self.mapping)

if self.drop_invariant:
Expand All @@ -206,14 +224,32 @@ def transform(self, X, override_return_df=False):
return X.values

@staticmethod
def fit_backward_difference_coding(values):
def fit_backward_difference_coding(col, values, handle_missing, handle_unknown):
if handle_missing == 'value':
values = values[values > 0]

values_to_encode = values.get_values()

if len(values) < 2:
return pd.DataFrame()
return pd.DataFrame(index=values_to_encode)

if handle_unknown == 'indicator':
values_to_encode = np.append(values_to_encode, -1)

backwards_difference_matrix = Diff().code_without_intercept(values_to_encode)
df = pd.DataFrame(data=backwards_difference_matrix.matrix, index=values_to_encode,
columns=[str(col) + '_%d' % (i, ) for i in range(len(backwards_difference_matrix.column_suffixes))])

if handle_unknown == 'return_nan':
df.loc[-1] = np.nan
elif handle_unknown == 'value':
df.loc[-1] = np.zeros(len(values_to_encode) - 1)

if handle_missing == 'return_nan':
df.loc[values.loc[np.nan]] = np.nan
elif handle_missing == 'value':
df.loc[-2] = np.zeros(len(values_to_encode) - 1)

backwards_difference_matrix = Diff().code_without_intercept(values)
df = pd.DataFrame(data=backwards_difference_matrix.matrix, columns=backwards_difference_matrix.column_suffixes)
df.index += 1
df.loc[0] = np.zeros(len(values) - 1)
return df

@staticmethod
Expand All @@ -230,19 +266,17 @@ def backward_difference_coding(X_in, mapping):
for switch in mapping:
col = switch.get('col')
mod = switch.get('mapping')
new_columns = []
for i in range(len(mod.columns)):
c = mod.columns[i]
new_col = str(col) + '_%d' % (i, )
X[new_col] = mod[c].loc[X[col]].values
new_columns.append(new_col)

base_df = mod.loc[X[col]]
base_df.set_index(X.index, inplace=True)
X = pd.concat([base_df, X], axis=1)

old_column_index = cols.index(col)
cols[old_column_index: old_column_index + 1] = new_columns
cols[old_column_index: old_column_index + 1] = mod.columns

cols = ['intercept'] + cols
X = X.reindex(columns=cols)

return X
return X.reindex(columns=cols)

def get_feature_names(self):
"""
Expand Down
Loading