[Bug]: Training and Test Set Overlap For Classification Tasks with Holdout Strategy #1390

dannycg1996 · 2024-12-20T12:05:17Z

Describe the bug

I think that I may have found an issue in FLAML, although it's possible it was a deliberate choice by the developers. Basically if we use the holdout strategy for classification tasks, then we will find that:
len(input_data) < len(training_data) + len(test_data).

This occurs even when I set auto_augment=False, so up-sampling of data is not the issue here.

Steps to reproduce

If I run a classification task against the Iris dataset then my input dataset has 150 rows.
If I then analyse the automl state afterwards, I can see that we have 135 rows in automl._state.X_train, and 18 rows in automl._state.X_val - making 153 rows in total - we have 3 too many rows in total. The code to reproduce this is pasted below:

from flaml import AutoML
from sklearn import datasets
import numpy as np

dic_data = datasets.load_iris(as_frame=True)  # numpy arrays
iris_data = dic_data["frame"]  # pandas dataframe data + target
rng = np.random.default_rng(42)
iris_data["cluster"] = rng.integers(
    low=0, high=5, size=iris_data.shape[0]
)
print(iris_data["cluster"])
print('shape at start', iris_data.shape)
automl = AutoML()
automl_settings = {
    "max_iter":5,
    "metric": 'accuracy',
    "task": 'classification',
    "log_file_name": "holdout_test.log",
    "log_type": "all",
    "estimator_list": ['lgbm'],
    "eval_method": "holdout",
    "split_type":"stratified",
    "keep_search_state":True,
    "retrain_full":True,
    "auto_augment":False,
}
x_train = iris_data[["sepal length (cm)","sepal width (cm)", "petal length (cm)","petal width (cm)"]].to_numpy()
y_train = iris_data['target']
automl.fit(x_train, y_train, **automl_settings)
print(len(automl._state.X_train), len(automl._state.X_train_all), len(automl._state.X_val))
print(len(automl._state.y_train), len(automl._state.y_train_all), len(automl._state.y_val))

My colleague @drwillcharles has identified the cause of this issue, which is found in the prepare_data method on flaml/automl/task/generic_task.py:

                X_train, X_val, y_train, y_val = self._train_test_split(
                    state, X_rest, y_rest, first, rest, split_ratio, stratify
                )
                X_train = concat(X_first, X_train)
                y_train = concat(label_set, y_train) if data_is_df else np.concatenate([label_set, y_train])
                X_val = concat(X_first, X_val)
                y_val = concat(label_set, y_val) if data_is_df else np.concatenate([label_set, y_val])

Here the first row containing each class has been extracted from the original training dataset (within X_first), and then once _train_test_split has been used to split the data, these rows are then added back to both the training and test datasets.

I'm not sure if this was an error, or a deliberate choice by the original developers.
I think that the advantage of this code would be that you guarantee that the training and testing datasets both contain at least one instance of every class. The disadvantage of this is that you have an overlap of training and test data, which will bias the models.

Possible Solution

Could I please ask your thoughts on this? Personally if you were going to keep this code, I'd rather you only applied it when it was required, and not in every case.

Perhaps we could just _train_test_split on the entire dataset (including X_first) and then only duplicate X_first if it's necessary (i.e. either the training or test set doesn't contain one/any of the classes)?

Please let me know if I'm misunderstanding anything - I welcome your thoughts.
Thanks!

Model Used

Random Forest in this example but this error has been present for all models tested.

Expected Behavior

No response

Screenshots and logs

No response

Additional Information

Python 3.10.3
FLAML 2.3.2

The text was updated successfully, but these errors were encountered:

drwillcharles · 2024-12-20T16:39:31Z

Commenting out / removing these two lines seems to work and ensures the split ratio is maintained.

It ensures that the data is not duplicated in test and train. The only downside is that if there is a case where there is a target with a unique label in the train dataset, this will not be evaluated in the test dataset. If you are happy with this approach then I am happy to open a PR.

thinkall · 2024-12-21T04:03:22Z

Thank you @dannycg1996 , @drwillcharles . For classification, we want to make sure the labels are complete in both training and validation data, thus we'll concat the first instance of each class into both train and val. This is not a bug.

drwillcharles · 2024-12-23T14:43:56Z

Thank you @dannycg1996 , @drwillcharles . For classification, we want to make sure the labels are complete in both training and validation data, thus we'll concat the first instance of each class into both train and val. This is not a bug.

Thank you for the explanation and I can see that this is a design choice rather than a bug.

I do still think that this still adds a bit of data leakage to the test set, although this becomes negligible with larger datasets.

dannycg1996 added the bug Something isn't working label Dec 20, 2024

dannycg1996 assigned thinkall Dec 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Training and Test Set Overlap For Classification Tasks with Holdout Strategy #1390

[Bug]: Training and Test Set Overlap For Classification Tasks with Holdout Strategy #1390

dannycg1996 commented Dec 20, 2024 •

edited

Loading

drwillcharles commented Dec 20, 2024 •

edited

Loading

thinkall commented Dec 21, 2024 •

edited

Loading

drwillcharles commented Dec 23, 2024

[Bug]: Training and Test Set Overlap For Classification Tasks with Holdout Strategy #1390

[Bug]: Training and Test Set Overlap For Classification Tasks with Holdout Strategy #1390

Comments

dannycg1996 commented Dec 20, 2024 • edited Loading

Describe the bug

Steps to reproduce

Model Used

Expected Behavior

Screenshots and logs

Additional Information

drwillcharles commented Dec 20, 2024 • edited Loading

thinkall commented Dec 21, 2024 • edited Loading

drwillcharles commented Dec 23, 2024

dannycg1996 commented Dec 20, 2024 •

edited

Loading

drwillcharles commented Dec 20, 2024 •

edited

Loading

thinkall commented Dec 21, 2024 •

edited

Loading