Assignment 11 - Incorrect ML Preprocessing Procedure #748

Jerrryyy · 2023-11-27T09:02:37Z

Hi,

For Assignment 11, the .ipynb scales the data before train-test splits (screenshot below). However, this is incorrect; scaling and centering should be done after splitting and only on the training set (scaler.fit_transform(X_train)). The parameters derived from the train set should then be applied to the test set (scaler.transform(X_test)) to prevent data leakage and biasing the model. The test set should be treated as completely new/unseen data to the model, or else it's no longer generalizable.

Also wanted to bring up a super minor nitpick for variable conventions. I believe ML and linear algebra typically keep X uppercase and y lowercase, since X is a matrix, while y is (often) a vector.

Thank you for the fun semester so far,
Jerry

The text was updated successfully, but these errors were encountered:

RayNele · 2023-11-27T17:52:02Z

yeah that's correct. For anyone else interested, you can read about this data leakage phenomenon from skl documentation itself.

I think we kept it simple for the sake of the "intro" aspect of the assignment. Learning what ML does, and the concept of training and testing sets is complicated enough for students who touched python for the first time 8 weeks ago.

Jerrryyy changed the title ~~Assignment 11 - Incorrect ML Pre-Processing Procedure~~ Assignment 11 - Incorrect ML Preprocessing Procedure Nov 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Assignment 11 - Incorrect ML Preprocessing Procedure #748

Assignment 11 - Incorrect ML Preprocessing Procedure #748

Jerrryyy commented Nov 27, 2023 •

edited

Loading

RayNele commented Nov 27, 2023

Assignment 11 - Incorrect ML Preprocessing Procedure #748

Assignment 11 - Incorrect ML Preprocessing Procedure #748

Comments

Jerrryyy commented Nov 27, 2023 • edited Loading

RayNele commented Nov 27, 2023

Jerrryyy commented Nov 27, 2023 •

edited

Loading