Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Assignment 11 - Incorrect ML Preprocessing Procedure #748

Open
Jerrryyy opened this issue Nov 27, 2023 · 1 comment
Open

Assignment 11 - Incorrect ML Preprocessing Procedure #748

Jerrryyy opened this issue Nov 27, 2023 · 1 comment

Comments

@Jerrryyy
Copy link

Jerrryyy commented Nov 27, 2023

Hi,

For Assignment 11, the .ipynb scales the data before train-test splits (screenshot below). However, this is incorrect; scaling and centering should be done after splitting and only on the training set (scaler.fit_transform(X_train)). The parameters derived from the train set should then be applied to the test set (scaler.transform(X_test)) to prevent data leakage and biasing the model. The test set should be treated as completely new/unseen data to the model, or else it's no longer generalizable.

image

Also wanted to bring up a super minor nitpick for variable conventions. I believe ML and linear algebra typically keep X uppercase and y lowercase, since X is a matrix, while y is (often) a vector.

Thank you for the fun semester so far,
Jerry

@Jerrryyy Jerrryyy changed the title Assignment 11 - Incorrect ML Pre-Processing Procedure Assignment 11 - Incorrect ML Preprocessing Procedure Nov 27, 2023
@RayNele
Copy link

RayNele commented Nov 27, 2023

yeah that's correct. For anyone else interested, you can read about this data leakage phenomenon from skl documentation itself.

I think we kept it simple for the sake of the "intro" aspect of the assignment. Learning what ML does, and the concept of training and testing sets is complicated enough for students who touched python for the first time 8 weeks ago.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants