-
-
Notifications
You must be signed in to change notification settings - Fork 319
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Feature Request: Add PandasCategoricalEncoder
to encode categorical features as pandas categorical
#828
Comments
Hi @ClaudioSalvatoreArcidiacono Our encoders try to handle pandas categorical variables within their functionality. They should be able to take variables that are of type object and of type categorical simultaneously, and for those that are of type categorical, we do have some functionality to make it work (i.e., add categories from the train set to the test set to ensure compatibility). Did you test our encoders with a dataset where these did not work? |
Hey @solegalli, This encoder is very similar to I tried to add this feature as an extra option for |
Hi @ClaudioSalvatoreArcidiacono Sorry, that I come back late to this issue. I am not sure I understand the need of this transformer, maybe you can help me with this. The idea is to have an encoder that transforms categorical variables into numbers but retaining the format categorical, is that right? It would take categorical variables as input and return categorical variables as output but with numeric content. Is this correct? Would this transformer also accept object variables as input? and would we want to convert them as categorical after the transformation? Or is it exclusive for categorical input variables? And perhaps more importantly: this transformer would be useful because lightGBM handles categorical formats out of the box, correct? I imagine that lightGBM doesn't care if the numbers are ints or categorical, or does it handle them differently? In other words, why would we prefer to pass categorical format to lightGBM instead of integers if the content of the variable is numbers? |
Some libraries like LightGBM are well integrated with pandas categorical
types.
I could not find a nice implementation to encode categorical features as pandas
categorical columns while preserving the categories across different datasets. I would like to
propose the addition of a
PandasCategoricalEncoder
to thefeature_engine
library toaddress this issue.
Is your feature request related to a problem? Please describe.
Yes, I often encounter issues when working with categorical data in pandas. The current
methods do not ensure consistent encoding across different datasets, leading to
potential errors.
Describe the solution you'd like
I would like to implement the
PandasCategoricalEncoder
class, which will transformcategorical features into pandas categorical types. This encoder will ensure that
categories are encoded consistently between training and testing datasets, and it will
handle unseen categories gracefully based on specified parameters.
Describe alternatives you've considered
I have considered using existing categorical encoding libraries, but they do not provide
such feature.
Additional context
The
PandasCategoricalEncoder
will include features such as handling missing values,allowing for flexible unseen category management, and providing methods for inverse
transformation to retrieve original values. This will enhance the usability and
reliability of categorical data processing in pandas.
The text was updated successfully, but these errors were encountered: