Unable to create a small sample of 1000 train and 100 using MultilabelStratifiedShuffleSplit #15

meltedhead · 2021-03-31T22:43:21Z

Hi trent-b:

Thanks for this repository, hope you can help with my issue. I have a large json data set that i want to use MultilabelStratifiedShuffleSplit to create a smaller sample set.

def mlb_train_test_split(labels, test_size, train_size, random_state=0):
    with warnings.catch_warnings():
        warnings.simplefilter("ignore", category=FutureWarning)
        msss = MultilabelStratifiedShuffleSplit(
            test_size=test_size, train_size=train_size, random_state=random_state
        )
    train_idx, test_idx = next(msss.split(np.ones_like(labels), labels))
    return train_idx, test_idx

i then call the function as :

train_idx, test_idx = mlb_train_test_split(labels, test_size=1000 train_size=200, random_state=0)

When i look at the numbers I'm seeing way more than 200 rows. Is there a limitation? The labels length is approximately 500,000 in the dataset.

The text was updated successfully, but these errors were encountered:

trent-b · 2021-04-01T20:21:09Z

meltedhead,

Thank you for catching this bug. I do not think I ever tested with train_size set to a value other than None. As a workaround, you could do the following:

_, test_idx = mlb_train_test_split(labels, test_size=1200, train_size=None, random_state=0)
subset_labels = labels[test_idx].copy()
train_idx, test_idx = mlb_train_test_split(subset_labels, test_size=1000, train_size=None, random_state=1)
print('Num train labels:', len(subset_labels[train_idx]), '; proportions:', np.mean(subset_labels[train_idx], axis=0))
print('Num test labels:', len(subset_labels[test_idx]), '; proportions:', np.mean(subset_labels[test_idx], axis=0))

nlassaux · 2023-04-12T11:30:24Z

Hi there,

I don't know if it helps, but I can see the same in that case with only test_size:

from iterstrat.ml_stratifiers import MultilabelStratifiedShuffleSplit
import numpy as np

y = np.random.randint(2, size=(600, 40))
X = np.random.randint(2, size=(600, 5))

expected_test_size = 64
mskf = MultilabelStratifiedShuffleSplit(n_splits=10, test_size=expected_test_size)

for train_index, test_index in mskf.split(X, y):
   print("TRAIN:", len(train_index), "TEST:", len(test_index))

The above prints:

TRAIN: 529 TEST: 71
TRAIN: 533 TEST: 67
TRAIN: 531 TEST: 69
TRAIN: 532 TEST: 68
TRAIN: 532 TEST: 68
TRAIN: 530 TEST: 70
TRAIN: 532 TEST: 68
TRAIN: 532 TEST: 68
TRAIN: 533 TEST: 67
TRAIN: 533 TEST: 67

but I expected:

TRAIN: 536 TEST: 64
TRAIN: 536 TEST: 64
TRAIN: 536 TEST: 64
TRAIN: 536 TEST: 64
TRAIN: 536 TEST: 64
TRAIN: 536 TEST: 64
TRAIN: 536 TEST: 64
TRAIN: 536 TEST: 64
TRAIN: 536 TEST: 64
TRAIN: 536 TEST: 64

nlassaux · 2023-04-12T11:41:23Z

Ah, just read that in the doc of MultilabelStratifiedShuffleSplit:

Train and test sizes may be slightly different from desired due to the
preference of stratification over perfectly sized folds.

Knowing that the above case should be very well distributed, I wonder if an acceptable solution with the given test size is that uncommon

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to create a small sample of 1000 train and 100 using MultilabelStratifiedShuffleSplit #15

Unable to create a small sample of 1000 train and 100 using MultilabelStratifiedShuffleSplit #15

meltedhead commented Mar 31, 2021 •

edited

Loading

trent-b commented Apr 1, 2021

nlassaux commented Apr 12, 2023 •

edited

Loading

nlassaux commented Apr 12, 2023

Unable to create a small sample of 1000 train and 100 using MultilabelStratifiedShuffleSplit #15

Unable to create a small sample of 1000 train and 100 using MultilabelStratifiedShuffleSplit #15

Comments

meltedhead commented Mar 31, 2021 • edited Loading

trent-b commented Apr 1, 2021

nlassaux commented Apr 12, 2023 • edited Loading

nlassaux commented Apr 12, 2023

meltedhead commented Mar 31, 2021 •

edited

Loading

nlassaux commented Apr 12, 2023 •

edited

Loading