[DOC] User warning over sampling methods #1101

lcrmorin · 2024-10-12T14:23:19Z

Describe the issue linked to the documentation

There is some discussion going on about the usefulness of some (if not all) over / under sampling methods implemented in the imbalanced learn package.

Typically there is some doubt about the usefulness of SMOTE:

from researchers (To SMOTE or not to SMOTE ?)
from practitioners (see weekly discussion on Kaggle, Data Science stack exchange ... etc.)
and even one of the authors of the package (Learning from Imbalanced data: I was wrong but I was not the only one)

Basically it seems that:

Methods do not improve ranking (think AUC)
Methods do break probability calibration (ECE / calibration curve)

I think that it is a problem that those discussions are not more visible to the newcomers. (And that more experienced people need to have to deal with that on a weekly basis).

Suggest a potential alternative/fix

It would be nice to have

a clearer demonstration in the doc, because for the moment only the usage is described:

from collections import Counter
from sklearn.datasets import make_classification
from imblearn.over_sampling import SMOTE
X, y = make_classification(n_classes=2, class_sep=2,
weights=[0.1, 0.9], n_informative=3, n_redundant=1, flip_y=0,
n_features=20, n_clusters_per_class=1, n_samples=1000, random_state=10)
print('Original dataset shape %s' % Counter(y))
sm = SMOTE(random_state=42)
X_res, y_res = sm.fit_resample(X, y)
print('Resampled dataset shape %s' % Counter(y_res))

It shows that it oversampled, but not that it works either in terms or ranking (AUC) / probability calibration (ECE / calibration curve).

Could the doc be upgraded with a better exemple ?

a visible user warning regarding the discussions on usefulness of these methods.

While (one of the) authors have changed its mind about the usefulness of these methods, it seems that a younger crowd is still very eager to jump on these shiny methods. I think it would be helpful for the DS community to make a clearer stance.

I would suggest at least a very visible warning in the doc, like a red banner ('there are some discussion about the usefulness of these methods. See: XXX. Use with caution').

This could be expanded with a UserWarning... may be a bit brutal but it could prevent a lot of trouble.

Edit: not sure why it added the good first issue automatically... but I'll take it.

The text was updated successfully, but these errors were encountered:

glemaitre · 2024-10-14T11:59:56Z

Basically, we are also working in scikit-learn on this topic. As milestone, we want to have an example that show the effect of sample-weight and class-weight in scikit-learn and then I would like to revamp the documentation of imbalanced-learn.

lcrmorin · 2024-10-18T08:51:27Z

Thanks for the answer. Implementation and documentation within sklearn seems to be the way to go in the long run. Maybe in the short term this on-going work should be documented a bit more visibly... a lot of newcomers are still pushing SMOTE and the likes.

jamblejoe · 2024-12-20T14:26:27Z

Basically, we are also working in scikit-learn on this topic. As milestone, we want to have an example that show the effect of sample-weight and class-weight in scikit-learn and then I would like to revamp the documentation of imbalanced-learn.

Is there a linked PR or issue in scikit-learn? I am one of the "newcomers" and just found out about this package via stack-exchange

lcrmorin added the good first issue Indicates a good issue for first-time contributors label Oct 12, 2024

glemaitre removed the good first issue Indicates a good issue for first-time contributors label Oct 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DOC] User warning over sampling methods #1101

[DOC] User warning over sampling methods #1101

lcrmorin commented Oct 12, 2024 •

edited

Loading

glemaitre commented Oct 14, 2024

lcrmorin commented Oct 18, 2024

jamblejoe commented Dec 20, 2024

[DOC] User warning over sampling methods #1101

[DOC] User warning over sampling methods #1101

Comments

lcrmorin commented Oct 12, 2024 • edited Loading

Describe the issue linked to the documentation

Suggest a potential alternative/fix

glemaitre commented Oct 14, 2024

lcrmorin commented Oct 18, 2024

jamblejoe commented Dec 20, 2024

lcrmorin commented Oct 12, 2024 •

edited

Loading