Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

[Issue]: How does FLAML handle missing values #1358

Open
lizhuoq opened this issue Sep 22, 2024 · 1 comment
Open

[Issue]: How does FLAML handle missing values #1358

lizhuoq opened this issue Sep 22, 2024 · 1 comment
Labels
documentation Improvements or additions to documentation help wanted Extra attention is needed

Comments

@lizhuoq
Copy link

lizhuoq commented Sep 22, 2024

I looked in the FLAML documentation and I didn't see how FLAML handles missing values for regression and classification tasks for different estimators, FLAML should add in the documentation for different learning algorithms for different tasks, How FLAML handles missing values of categorical variables and continuous variables, this will be very helpful, thank you!

@lizhuoq lizhuoq changed the title [Issue]: FLAML [Issue]: How does FLAML handle missing values Sep 22, 2024
@thinkall thinkall added documentation Improvements or additions to documentation help wanted Extra attention is needed labels Oct 24, 2024
@dannycg1996
Copy link
Collaborator

dannycg1996 commented Nov 4, 2024

Hi @lizhuoq, FLAML doesn't appear to do any preprocessing to handle missing values - it leaves this to the estimators themselves.

To test this, I applied an LRL1 estimator to the Titanic Dataset (which contains missing data) - the following error was raised:
Image

Some estimators can't handle missing values, whilst others (like Catboost - see here) can. My code for generating the above error can be found below. If we change the estimator to instead be estimator_list: ['catboost'], no error will be raised.

import seaborn as sns
import pandas as pd
from flaml import AutoML
# load dataset titanic
titanic_df = sns.load_dataset('titanic')
titanic_df = titanic_df.drop(columns=["deck"])
X_train = titanic_df.drop(columns = ['survived']).to_numpy()
y_train = pd.DataFrame(titanic_df['survived']).to_numpy()
automl_settings = {
    "time_budget": 20,  # in seconds
    "metric": 'accuracy',
    "estimator_list": ['lrl1'],
    "task": 'classification',
    "log_file_name": "titanic_test.log",
    "n_splits":10,
    "split_type": 'uniform'
}
automl = AutoML()
automl.fit(X_train, y_train, **automl_settings)

I hope that helps!

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
documentation Improvements or additions to documentation help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

3 participants