Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

[Bug]: Training and Test Set Overlap For Classification Tasks with Holdout Strategy #1390

Open
dannycg1996 opened this issue Dec 20, 2024 · 3 comments
Assignees
Labels
bug Something isn't working

Comments

@dannycg1996
Copy link
Collaborator

dannycg1996 commented Dec 20, 2024

Describe the bug

Hi @thinkall,

I think that I may have found an issue in FLAML, although it's possible it was a deliberate choice by the developers. Basically if we use the holdout strategy for classification tasks, then we will find that:
len(input_data) < len(training_data) + len(test_data).

This occurs even when I set auto_augment=False, so up-sampling of data is not the issue here.

Steps to reproduce

If I run a classification task against the Iris dataset then my input dataset has 150 rows.
If I then analyse the automl state afterwards, I can see that we have 135 rows in automl._state.X_train, and 18 rows in automl._state.X_val - making 153 rows in total - we have 3 too many rows in total. The code to reproduce this is pasted below:

from flaml import AutoML
from sklearn import datasets
import numpy as np

dic_data = datasets.load_iris(as_frame=True)  # numpy arrays
iris_data = dic_data["frame"]  # pandas dataframe data + target
rng = np.random.default_rng(42)
iris_data["cluster"] = rng.integers(
    low=0, high=5, size=iris_data.shape[0]
)
print(iris_data["cluster"])
print('shape at start', iris_data.shape)
automl = AutoML()
automl_settings = {
    "max_iter":5,
    "metric": 'accuracy',
    "task": 'classification',
    "log_file_name": "holdout_test.log",
    "log_type": "all",
    "estimator_list": ['lgbm'],
    "eval_method": "holdout",
    "split_type":"stratified",
    "keep_search_state":True,
    "retrain_full":True,
    "auto_augment":False,
}
x_train = iris_data[["sepal length (cm)","sepal width (cm)", "petal length (cm)","petal width (cm)"]].to_numpy()
y_train = iris_data['target']
automl.fit(x_train, y_train, **automl_settings)
print(len(automl._state.X_train), len(automl._state.X_train_all), len(automl._state.X_val))
print(len(automl._state.y_train), len(automl._state.y_train_all), len(automl._state.y_val))

My colleague @drwillcharles has identified the cause of this issue, which is found in the prepare_data method on flaml/automl/task/generic_task.py:

                X_train, X_val, y_train, y_val = self._train_test_split(
                    state, X_rest, y_rest, first, rest, split_ratio, stratify
                )
                X_train = concat(X_first, X_train)
                y_train = concat(label_set, y_train) if data_is_df else np.concatenate([label_set, y_train])
                X_val = concat(X_first, X_val)
                y_val = concat(label_set, y_val) if data_is_df else np.concatenate([label_set, y_val])

Here the first row containing each class has been extracted from the original training dataset (within X_first), and then once _train_test_split has been used to split the data, these rows are then added back to both the training and test datasets.

I'm not sure if this was an error, or a deliberate choice by the original developers.
I think that the advantage of this code would be that you guarantee that the training and testing datasets both contain at least one instance of every class. The disadvantage of this is that you have an overlap of training and test data, which will bias the models.

Possible Solution

Could I please ask your thoughts on this? Personally if you were going to keep this code, I'd rather you only applied it when it was required, and not in every case.

Perhaps we could just _train_test_split on the entire dataset (including X_first) and then only duplicate X_first if it's necessary (i.e. either the training or test set doesn't contain one/any of the classes)?

Please let me know if I'm misunderstanding anything - I welcome your thoughts.
Thanks!

Model Used

Random Forest in this example but this error has been present for all models tested.

Expected Behavior

No response

Screenshots and logs

No response

Additional Information

Python 3.10.3
FLAML 2.3.2

@dannycg1996 dannycg1996 added the bug Something isn't working label Dec 20, 2024
@drwillcharles
Copy link

drwillcharles commented Dec 20, 2024

Commenting out / removing these two lines seems to work and ensures the split ratio is maintained.
Image

It ensures that the data is not duplicated in test and train. The only downside is that if there is a case where there is a target with a unique label in the train dataset, this will not be evaluated in the test dataset. If you are happy with this approach then I am happy to open a PR.

@thinkall
Copy link
Collaborator

thinkall commented Dec 21, 2024

Thank you @dannycg1996 , @drwillcharles . For classification, we want to make sure the labels are complete in both training and validation data, thus we'll concat the first instance of each class into both train and val. This is not a bug.

@drwillcharles
Copy link

Thank you @dannycg1996 , @drwillcharles . For classification, we want to make sure the labels are complete in both training and validation data, thus we'll concat the first instance of each class into both train and val. This is not a bug.

Thank you for the explanation and I can see that this is a design choice rather than a bug.

I do still think that this still adds a bit of data leakage to the test set, although this becomes negligible with larger datasets.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants