You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I think that I may have found an issue in FLAML, although it's possible it was a deliberate choice by the developers. Basically if we use the holdout strategy for classification tasks, then we will find that: len(input_data) < len(training_data) + len(test_data).
This occurs even when I set auto_augment=False, so up-sampling of data is not the issue here.
Steps to reproduce
If I run a classification task against the Iris dataset then my input dataset has 150 rows.
If I then analyse the automl state afterwards, I can see that we have 135 rows in automl._state.X_train, and 18 rows in automl._state.X_val - making 153 rows in total - we have 3 too many rows in total. The code to reproduce this is pasted below:
Here the first row containing each class has been extracted from the original training dataset (within X_first), and then once _train_test_split has been used to split the data, these rows are then added back to both the training and test datasets.
I'm not sure if this was an error, or a deliberate choice by the original developers.
I think that the advantage of this code would be that you guarantee that the training and testing datasets both contain at least one instance of every class. The disadvantage of this is that you have an overlap of training and test data, which will bias the models.
Possible Solution
Could I please ask your thoughts on this? Personally if you were going to keep this code, I'd rather you only applied it when it was required, and not in every case.
Perhaps we could just _train_test_split on the entire dataset (including X_first) and then only duplicate X_first if it's necessary (i.e. either the training or test set doesn't contain one/any of the classes)?
Please let me know if I'm misunderstanding anything - I welcome your thoughts.
Thanks!
Model Used
Random Forest in this example but this error has been present for all models tested.
Expected Behavior
No response
Screenshots and logs
No response
Additional Information
Python 3.10.3
FLAML 2.3.2
The text was updated successfully, but these errors were encountered:
Commenting out / removing these two lines seems to work and ensures the split ratio is maintained.
It ensures that the data is not duplicated in test and train. The only downside is that if there is a case where there is a target with a unique label in the train dataset, this will not be evaluated in the test dataset. If you are happy with this approach then I am happy to open a PR.
Thank you @dannycg1996 , @drwillcharles . For classification, we want to make sure the labels are complete in both training and validation data, thus we'll concat the first instance of each class into both train and val. This is not a bug.
Thank you @dannycg1996 , @drwillcharles . For classification, we want to make sure the labels are complete in both training and validation data, thus we'll concat the first instance of each class into both train and val. This is not a bug.
Thank you for the explanation and I can see that this is a design choice rather than a bug.
I do still think that this still adds a bit of data leakage to the test set, although this becomes negligible with larger datasets.
Describe the bug
Hi @thinkall,
I think that I may have found an issue in FLAML, although it's possible it was a deliberate choice by the developers. Basically if we use the holdout strategy for classification tasks, then we will find that:
len(input_data) < len(training_data) + len(test_data)
.This occurs even when I set
auto_augment=False
, so up-sampling of data is not the issue here.Steps to reproduce
If I run a classification task against the Iris dataset then my input dataset has 150 rows.
If I then analyse the automl state afterwards, I can see that we have 135 rows in automl._state.X_train, and 18 rows in automl._state.X_val - making 153 rows in total - we have 3 too many rows in total. The code to reproduce this is pasted below:
My colleague @drwillcharles has identified the cause of this issue, which is found in the
prepare_data
method onflaml/automl/task/generic_task.py
:Here the first row containing each class has been extracted from the original training dataset (within
X_first
), and then once_train_test_split
has been used to split the data, these rows are then added back to both the training and test datasets.I'm not sure if this was an error, or a deliberate choice by the original developers.
I think that the advantage of this code would be that you guarantee that the training and testing datasets both contain at least one instance of every class. The disadvantage of this is that you have an overlap of training and test data, which will bias the models.
Possible Solution
Could I please ask your thoughts on this? Personally if you were going to keep this code, I'd rather you only applied it when it was required, and not in every case.
Perhaps we could just
_train_test_split
on the entire dataset (includingX_first
) and then only duplicateX_first
if it's necessary (i.e. either the training or test set doesn't contain one/any of the classes)?Please let me know if I'm misunderstanding anything - I welcome your thoughts.
Thanks!
Model Used
Random Forest in this example but this error has been present for all models tested.
Expected Behavior
No response
Screenshots and logs
No response
Additional Information
Python 3.10.3
FLAML 2.3.2
The text was updated successfully, but these errors were encountered: