Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

[BUG] SMOTENC fails with ValueError: zero-size array to reduction operation maximum which has no identity #1035

Open
Ingvar-Y opened this issue Aug 17, 2023 · 0 comments

Comments

@Ingvar-Y
Copy link

Describe the bug

SMOTENC fit_transform fails with Numpy error ValueError: zero-size array to reduction operation maximum which has no identity when getting to this line:

is_max = np.isclose(col_maxs, col_maxs.max(axis=1, keepdims=True))

Steps/Code to Reproduce

The reason is unclear, maybe it is a highly imbalanced dataset with binary target equal to 1 in 134/22763 samples.
Example:

from imblearn.over_sampling import SMOTENC
oversample = SMOTENC(
    categorical_features=labels[:10],
    categorical_encoder=OneHotEncoder(drop="if_binary", handle_unknown="ignore"),
    sampling_strategy="minority",
)
X_smotenc, y_smotenc = oversample.fit_resample(X, y)

Using SMOTE instead works without problem.

Expected Results

Dataset oversample

Actual Results

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[156], line 1
----> 1 trdata_smotenc, tgt_smotenc = oversample.fit_resample(ctrdata, trdata.DEF_FLG)

File ~\.conda\envs\test\Lib\site-packages\imblearn\base.py:208, in BaseSampler.fit_resample(self, X, y)
    187 """Resample the dataset.
    188 
    189 Parameters
   (...)
    205     The corresponding label of `X_resampled`.
    206 """
    207 self._validate_params()
--> 208 return super().fit_resample(X, y)

File ~\.conda\envs\test\Lib\site-packages\imblearn\base.py:112, in SamplerMixin.fit_resample(self, X, y)
    106 X, y, binarize_y = self._check_X_y(X, y)
    108 self.sampling_strategy_ = check_sampling_strategy(
    109     self.sampling_strategy, y, self._sampling_type
    110 )
--> 112 output = self._fit_resample(X, y)
    114 y_ = (
    115     label_binarize(output[1], classes=np.unique(y)) if binarize_y else output[1]
    116 )
    118 X_, y_ = arrays_transformer.transform(output[0], y_)

File ~\.conda\envs\test\Lib\site-packages\imblearn\over_sampling\_smote\base.py:683, in SMOTENC._fit_resample(self, X, y)
    680 X_ohe.data = np.ones_like(X_ohe.data, dtype=X_ohe.dtype) * self.median_std_ / 2
    681 X_encoded = sparse.hstack((X_continuous, X_ohe), format="csr")
--> 683 X_resampled, y_resampled = super()._fit_resample(X_encoded, y)
    685 # reverse the encoding of the categorical features
    686 X_res_cat = X_resampled[:, self.continuous_features_.size :]

File ~\.conda\envs\test\Lib\site-packages\imblearn\over_sampling\_smote\base.py:365, in SMOTE._fit_resample(self, X, y)
    363 self.nn_k_.fit(X_class)
    364 nns = self.nn_k_.kneighbors(X_class, return_distance=False)[:, 1:]
--> 365 X_new, y_new = self._make_samples(
    366     X_class, y.dtype, class_sample, X_class, nns, n_samples, 1.0
    367 )
    368 X_resampled.append(X_new)
    369 y_resampled.append(y_new)

File ~\.conda\envs\test\Lib\site-packages\imblearn\over_sampling\_smote\base.py:119, in BaseSMOTE._make_samples(self, X, y_dtype, y_type, nn_data, nn_num, n_samples, step_size)
    116 rows = np.floor_divide(samples_indices, nn_num.shape[1])
    117 cols = np.mod(samples_indices, nn_num.shape[1])
--> 119 X_new = self._generate_samples(X, nn_data, nn_num, rows, cols, steps)
    120 y_new = np.full(n_samples, fill_value=y_type, dtype=y_dtype)
    121 return X_new, y_new

File ~\.conda\envs\test\Lib\site-packages\imblearn\over_sampling\_smote\base.py:755, in SMOTENC._generate_samples(self, X, nn_data, nn_num, rows, cols, steps)
    753 col_maxs = all_neighbors[:, :, start_idx:end_idx].sum(axis=1)
    754 # tie breaking argmax
--> 755 is_max = np.isclose(col_maxs, col_maxs.max(axis=1, keepdims=True))
    756 max_idxs = rng.permutation(np.argwhere(is_max))
    757 xs, idx_sels = np.unique(max_idxs[:, 0], return_index=True)

File ~\.conda\envs\test\Lib\site-packages\numpy\core\_methods.py:41, in _amax(a, axis, out, keepdims, initial, where)
     39 def _amax(a, axis=None, out=None, keepdims=False,
     40           initial=_NoValue, where=True):
---> 41     return umr_maximum(a, axis, None, out, keepdims, initial, where)

ValueError: zero-size array to reduction operation maximum which has no identity

Versions

System:
    python: 3.11.4 | packaged by conda-forge | (main, Jun 10 2023, 17:59:51) [MSC v.1935 64 bit (AMD64)]
executable: C:\Users\user\.conda\envs\test\python.exe
   machine: Windows-10-10.0.19045-SP0

Python dependencies:
      sklearn: 1.3.0
          pip: 23.2.1
   setuptools: 68.0.0
        numpy: 1.25.2
        scipy: 1.11.1
       Cython: None
       pandas: 2.0.3
   matplotlib: 3.7.2
       joblib: 1.3.2
threadpoolctl: 3.2.0

Built with OpenMP: True

threadpoolctl info:
       user_api: blas
   internal_api: mkl
    num_threads: 4
         prefix: libblas
       filepath: C:\Users\user\.conda\envs\test\Library\bin\libblas.dll
        version: 2022.1-Product
threading_layer: intel

       user_api: openmp
   internal_api: openmp
    num_threads: 8
         prefix: vcomp
       filepath: C:\Users\user\.conda\envs\test\vcomp140.dll
        version: None

       user_api: openmp
   internal_api: openmp
    num_threads: 8
         prefix: libiomp
       filepath: C:\Users\user\.conda\envs\test\Library\bin\libiomp5md.dll
        version: None
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant