Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

[Bug]: Sometimes the optimal results of non-optimal estimators are not saved #1388

Open
flippercy opened this issue Dec 16, 2024 · 8 comments
Labels
bug Something isn't working

Comments

@flippercy
Copy link

Describe the bug

Hi:

I've created two customized lightGBM estimators for automl:

class MyMonotonicLightGBMGBDTClassifier(BaseEstimator):

def __init__(self, task = 'binary:logistic', n_jobs = num_cores, **params):

    super().__init__(task, **params)

    self.estimator_class = LGBMClassifier

    # convert to int for integer hyperparameters
    self.params = {
        'n_jobs': params['n_jobs'] if 'n_jobs' in params else num_cores,
        'boosting_type':params['boosting_type'] if 'boosting_type' in params else 'gbdt',
        'colsample_bytree':params['colsample_bytree'],
        'n_estimators':int(params['n_estimators']),
        'random_state': params['random_state'] if 'random_state' in params else randomseed,
        "monotone_constraints":params['monotone_constraints'] if 'monotone_constraints' in params else monotone,
        
   }   

@classmethod
def search_space(cls, data_size, task):
    
    space = {     
    'n_estimators': {'domain': tune.uniform(lower = 50, upper = 500), 'init_value': 200, 'low_cost_init_value': 200},
    'colsample_bytree': {'domain': tune.uniform(lower = 0.5, upper = 1), 'init_value': 0.9, 'low_cost_init_value': 0.9},
    }
    return space

automl.add_learner(learner_name = 'MonotonicLightGBMGBDT', learner_class = MyMonotonicLightGBMGBDTClassifier)

class MyMonotonicLightGBMDartClassifier(BaseEstimator):

def __init__(self, task = 'binary:logistic', n_jobs = num_cores, **params):

    super().__init__(task, **params)

    self.estimator_class = LGBMClassifier

    self.params = {
        'n_jobs': params['n_jobs'] if 'n_jobs' in params else num_cores,
        'boosting_type':params['boosting_type'] if 'boosting_type' in params else 'dart',
        'colsample_bytree':params['colsample_bytree'],
        'n_estimators':int(params['n_estimators']),
        'drop_rate': params['drop_rate'],
        'random_state': params['random_state'] if 'random_state' in params else randomseed,
        "monotone_constraints":params['monotone_constraints'] if 'monotone_constraints' in params else monotone,
   }   

@classmethod
def search_space(cls, data_size, task):
    
    space = {        
    'n_estimators': {'domain': tune.uniform(lower = 50, upper = 500), 'init_value': 200, 'low_cost_init_value': 200},
    'colsample_bytree': {'domain': tune.uniform(lower = 0.5, upper = 1), 'init_value': 0.9, 'low_cost_init_value': 0.9},
    'drop_rate': {'domain': tune.uniform(lower = 0.1, upper = 0.4), 'init_value': 0.2, 'low_cost_init_value': 0.2},
    }
    return space

automl.add_learner(learner_name = 'MonotonicLightGBMDart', learner_class = MyMonotonicLightGBMDartClassifier)

Then I call these two estimators for automl with the setting below:

from flaml import AutoML
from flaml.automl.model import BaseEstimator, LRL1Classifier
from xgboost.sklearn import XGBClassifier
from lightgbm.sklearn import LGBMClassifier

estimator_list= [ 'MonotonicLightGBMDart', 'MonotonicLightGBMGBDT']

settings = {
"keep_search_state": True,
"time_budget": flaml_time_budget,
'max_iter': 15,
'mem_thres': flaml_mem_thres,
"metric": 'roc_auc',
"task": 'classification',
"estimator_list": estimator_list,
"log_file_name": logfilename,
"log_type":'all',
"seed":randomseed,
"model_history":True
}

The process usually runs well; however I noticed one issue: sometimes the best result of the estimator which is not the optimal one is not saved. For example, after the search I want to retrieve the best models of both MonotonicLightGBMDart and MonotonicLightGBMGBDT. In this case, if the overall optimal model returned is built by MonotonicLightGBMDart, then sometimes, the best model by MonotonicLightGBMGBDT is not saved (returned an empty model when I tried automl.best_model_for_estimator('MonotonicLightGBMGBDT')._model).

What makes me more confused is that it does not happen every time and is not always repeatable. Sometimes if I restarted the kernel and re-ran the process the issue disappeared.

Could anyone check my codes and tell me the reason for this problem?

Thank you.

Steps to reproduce

No response

Model Used

No response

Expected Behavior

No response

Screenshots and logs

No response

Additional Information

No response

@flippercy flippercy added the bug Something isn't working label Dec 16, 2024
@thinkall
Copy link
Collaborator

Hi @flippercy , thank you for reporting the issue. It happens when one estimator is nevered trained. Could you check the detailed logs to confirm that?

@flippercy
Copy link
Author

Hi @thinkall:

Unfortunately this is not the reason. Based on the logs and results from other functions (such as automl._search_states.items()), all the estimators have been trained. I can even retrieve the optimal results (such as AUC) for each learner without any issue; the only problem is that the best model itself of non-optimal learners cannot be saved sometimes. It does not happen every time but just randomly, which makes it harder to troubleshoot.

Is it due to my customized learners? Could you check them quickly please?

Thank you.

@thinkall
Copy link
Collaborator

Hi @flippercy , I guess it happens when the non-optimal learner is not fully trained at all. Do you mind share a full history of logs, and code snippet for reproducing?

@flippercy
Copy link
Author

Hi @thinkall:

Thank you for the response. Unfortunately the related log file was deleted; however, I can guarantee that under training is not the reason for this issue. With the example above, if I did the search for 100 iterations, the ratio between the usage of the two estimators was usually about 6:4.

@thinkall
Copy link
Collaborator

Hi @flippercy , it would be helpful if you can share a complete code snippet for reproducing the issue. Thanks.

@flippercy
Copy link
Author

flippercy commented Dec 19, 2024

@thinkall:

Thank you for the response. I am afraid that the issue is not easily repeatable because first of all, it happens RANDOMLY. As I said, our current "solution" is simply restart the kernel and rerun the whole process; most of the time the issue will be gone; if not, we will repeat until it disappears......; in addition, I am not sure whether it will happen with default learners. I suspect that probably it is related with my customized learners but cannot find any clue.

No matter what, below are the codes I used:

_predictors_to_use_for_FLAML = predictors_i + RawModelingVariables

df = pd.DataFrame(data_dev_balanced_B_WtCor, index=[targetVariable])
monotone_values = (df[predictors_to_use_for_FLAML] / df[predictors_to_use_for_FLAML].abs()).astype(int).values.tolist()
predictors_to_use_for_FLAML_monotone = []
for sublist in monotone_values:
predictors_to_use_for_FLAML_monotone.extend(sublist)

data_dev_balanced_B_flaml = data_dev_balanced_B.loc[:, IndexVariables + [targetVariable, weightVariable] + predictors_to_use_for_FLAML]
data_dev_balanced_B_flaml[targetVariable] = data_dev_balanced_B_flaml[targetVariable].astype(int)

data_val_balanced_B_flaml = data_val_balanced_B.loc[:, IndexVariables + [targetVariable, weightVariable] + predictors_to_use_for_FLAML]
data_val_balanced_B_flaml[targetVariable] = data_val_balanced_B_flaml[targetVariable].astype(int)

logfilename = outputDir + '/model_result.txt'

flaml_estimator_list= ['MonotonicLightGBMGBDT', 'MonotonicLightGBMDart']

flaml_time_budget = int(3600243) # seconds
flaml_max_iter = 250
flaml_mem_thres = 1024 * 1024 * 1024 * 60 # bytes
randomNumberSeed = int(randomNumberSeed)

import flaml as flaml
import pickle
import random
import numpy as np
import time as time
import pandas as pd
from flaml import tune
from flaml import AutoML
from flaml.automl.model import BaseEstimator, LRL1Classifier
from xgboost.sklearn import XGBClassifier
from lightgbm.sklearn import LGBMClassifier

automl = AutoML()
print(flaml.version)

num_cores = numCores
randomseed = randomNumberSeed

predictors_to_consider_for_FLAML = predictors_to_use_for_FLAML

monotone=tuple(predictors_to_use_for_FLAML_monotone)

data_dev_balanced_B_X=data_dev_balanced_B_flaml[data_dev_balanced_B_flaml.columns.intersection(predictors_to_consider_for_FLAML)]
data_dev_balanced_B_y=data_dev_balanced_B_flaml[targetVariable].values.ravel()
data_dev_balanced_B_w=data_dev_balanced_B_flaml[weightVariable].values.ravel()

data_val_balanced_B_X=data_val_balanced_B_flaml[data_val_balanced_B_flaml.columns.intersection(predictors_to_consider_for_FLAML)]
data_val_balanced_B_y=data_val_balanced_B_flaml[targetVariable].values.ravel()
data_val_balanced_B_w=data_val_balanced_B_flaml[weightVariable].values.ravel()

class MyMonotonicLightGBMGBDTClassifier(BaseEstimator):

def init(self, task = 'binary:logistic', n_jobs = num_cores, **params):

super().__init__(task, **params)

self.estimator_class = LGBMClassifier

# convert to int for integer hyperparameters
self.params = {
    'n_jobs': params['n_jobs'] if 'n_jobs' in params else num_cores,
    'boosting_type':params['boosting_type'] if 'boosting_type' in params else 'gbdt',
    'colsample_bytree':params['colsample_bytree'],
    'n_estimators':int(params['n_estimators']),
    'random_state': params['random_state'] if 'random_state' in params else randomseed,
    "monotone_constraints":params['monotone_constraints'] if 'monotone_constraints' in params else monotone,

}

@classmethod
def search_space(cls, data_size, task):

space = {     
'n_estimators': {'domain': tune.uniform(lower = 50, upper = 500), 'init_value': 200, 'low_cost_init_value': 200},
'colsample_bytree': {'domain': tune.uniform(lower = 0.5, upper = 1), 'init_value': 0.9, 'low_cost_init_value': 0.9},
}
return space

automl.add_learner(learner_name = 'MonotonicLightGBMGBDT', learner_class = MyMonotonicLightGBMGBDTClassifier)

class MyMonotonicLightGBMDartClassifier(BaseEstimator):

def init(self, task = 'binary:logistic', n_jobs = num_cores, **params):

super().__init__(task, **params)

self.estimator_class = LGBMClassifier

self.params = {
    'n_jobs': params['n_jobs'] if 'n_jobs' in params else num_cores,
    'boosting_type':params['boosting_type'] if 'boosting_type' in params else 'dart',
    'colsample_bytree':params['colsample_bytree'],
    'n_estimators':int(params['n_estimators']),
    'drop_rate': params['drop_rate'],
    'random_state': params['random_state'] if 'random_state' in params else randomseed,
    "monotone_constraints":params['monotone_constraints'] if 'monotone_constraints' in params else monotone,

}

@classmethod
def search_space(cls, data_size, task):

space = {        
'n_estimators': {'domain': tune.uniform(lower = 50, upper = 500), 'init_value': 200, 'low_cost_init_value': 200},
'colsample_bytree': {'domain': tune.uniform(lower = 0.5, upper = 1), 'init_value': 0.9, 'low_cost_init_value': 0.9},
'drop_rate': {'domain': tune.uniform(lower = 0.1, upper = 0.4), 'init_value': 0.2, 'low_cost_init_value': 0.2},
}
return space

automl.add_learner(learner_name = 'MonotonicLightGBMDart', learner_class = MyMonotonicLightGBMDartClassifier)

estimator_list= flaml_estimator_list

settings = {
"keep_search_state": False,
"time_budget": flaml_time_budget,
'max_iter': 15,
'mem_thres': flaml_mem_thres,
"metric": 'roc_auc',
"task": 'classification',
"estimator_list": estimator_list,
"log_file_name": logfilename,
"log_type":'all',
"seed":randomseed,
"model_history":True,
}

automl.fit(X_train = data_dev_balanced_B_X, y_train = data_dev_balanced_B_y, sample_weight=data_dev_balanced_B_w,
X_val = data_val_balanced_B_X, y_val = data_val_balanced_B_y, sample_weight_val=data_val_balanced_B_w, **settings)

for x in estimator_list:
automl_best_model = automl.best_model_for_estimator(x)
if automl_best_model is not None:
automl_best_model.model.booster.save_model(dataDir + '/Best_' + x)_

@flippercy
Copy link
Author

model_result.txt

This is the log file from a recent search. After that, the optimal model by learner "MonotonicLightGBMDart" cannot be saved.

@thinkall
Copy link
Collaborator

Hi @flippercy , this is tricky. I'm not sure what's the root cause. But as you've mentioned:

the best model by MonotonicLightGBMGBDT is not saved (returned an empty model when I tried automl.best_model_for_estimator('MonotonicLightGBMGBDT')._model"

You got an empty model (dummy model?) instead of None. Maybe you've hit resource limitation. Just some thoughts.

FLAML/flaml/automl/model.py

Lines 278 to 287 in 6d53929

except (MemoryError, TimeoutError) as e:
logger.warning(f"{e.__class__} {e}")
if self._task.is_classification():
model = DummyClassifier()
else:
model = DummyRegressor()
X_train = self._preprocess(X_train)
model.fit(X_train, y_train)
self._model = model
train_time = time.time() - start_time

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants