Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Unable to create a small sample of 1000 train and 100 using MultilabelStratifiedShuffleSplit #15

Open
meltedhead opened this issue Mar 31, 2021 · 3 comments

Comments

@meltedhead
Copy link

meltedhead commented Mar 31, 2021

Hi trent-b:

Thanks for this repository, hope you can help with my issue. I have a large json data set that i want to use MultilabelStratifiedShuffleSplit to create a smaller sample set.

def mlb_train_test_split(labels, test_size, train_size, random_state=0):
    with warnings.catch_warnings():
        warnings.simplefilter("ignore", category=FutureWarning)
        msss = MultilabelStratifiedShuffleSplit(
            test_size=test_size, train_size=train_size, random_state=random_state
        )
    train_idx, test_idx = next(msss.split(np.ones_like(labels), labels))
    return train_idx, test_idx

i then call the function as :

train_idx, test_idx = mlb_train_test_split(labels, test_size=1000 train_size=200, random_state=0)

When i look at the numbers I'm seeing way more than 200 rows. Is there a limitation? The labels length is approximately 500,000 in the dataset.

@trent-b
Copy link
Owner

trent-b commented Apr 1, 2021

meltedhead,

Thank you for catching this bug. I do not think I ever tested with train_size set to a value other than None. As a workaround, you could do the following:

_, test_idx = mlb_train_test_split(labels, test_size=1200, train_size=None, random_state=0)
subset_labels = labels[test_idx].copy()
train_idx, test_idx = mlb_train_test_split(subset_labels, test_size=1000, train_size=None, random_state=1)
print('Num train labels:', len(subset_labels[train_idx]), '; proportions:', np.mean(subset_labels[train_idx], axis=0))
print('Num test labels:', len(subset_labels[test_idx]), '; proportions:', np.mean(subset_labels[test_idx], axis=0))

@nlassaux
Copy link

nlassaux commented Apr 12, 2023

Hi there,

I don't know if it helps, but I can see the same in that case with only test_size:

from iterstrat.ml_stratifiers import MultilabelStratifiedShuffleSplit
import numpy as np

y = np.random.randint(2, size=(600, 40))
X = np.random.randint(2, size=(600, 5))

expected_test_size = 64
mskf = MultilabelStratifiedShuffleSplit(n_splits=10, test_size=expected_test_size)

for train_index, test_index in mskf.split(X, y):
   print("TRAIN:", len(train_index), "TEST:", len(test_index))

The above prints:

TRAIN: 529 TEST: 71
TRAIN: 533 TEST: 67
TRAIN: 531 TEST: 69
TRAIN: 532 TEST: 68
TRAIN: 532 TEST: 68
TRAIN: 530 TEST: 70
TRAIN: 532 TEST: 68
TRAIN: 532 TEST: 68
TRAIN: 533 TEST: 67
TRAIN: 533 TEST: 67

but I expected:

TRAIN: 536 TEST: 64
TRAIN: 536 TEST: 64
TRAIN: 536 TEST: 64
TRAIN: 536 TEST: 64
TRAIN: 536 TEST: 64
TRAIN: 536 TEST: 64
TRAIN: 536 TEST: 64
TRAIN: 536 TEST: 64
TRAIN: 536 TEST: 64
TRAIN: 536 TEST: 64

@nlassaux
Copy link

Ah, just read that in the doc of MultilabelStratifiedShuffleSplit:

Train and test sizes may be slightly different from desired due to the
preference of stratification over perfectly sized folds.

Knowing that the above case should be very well distributed, I wonder if an acceptable solution with the given test size is that uncommon

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants