You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
For Assignment 11, the .ipynb scales the data before train-test splits (screenshot below). However, this is incorrect; scaling and centering should be done after splitting and only on the training set (scaler.fit_transform(X_train)). The parameters derived from the train set should then be applied to the test set (scaler.transform(X_test)) to prevent data leakage and biasing the model. The test set should be treated as completely new/unseen data to the model, or else it's no longer generalizable.
Also wanted to bring up a super minor nitpick for variable conventions. I believe ML and linear algebra typically keep X uppercase and y lowercase, since X is a matrix, while y is (often) a vector.
Thank you for the fun semester so far,
Jerry
The text was updated successfully, but these errors were encountered:
Jerrryyy
changed the title
Assignment 11 - Incorrect ML Pre-Processing Procedure
Assignment 11 - Incorrect ML Preprocessing Procedure
Nov 27, 2023
yeah that's correct. For anyone else interested, you can read about this data leakage phenomenon from skl documentation itself.
I think we kept it simple for the sake of the "intro" aspect of the assignment. Learning what ML does, and the concept of training and testing sets is complicated enough for students who touched python for the first time 8 weeks ago.
Hi,
For Assignment 11, the .ipynb scales the data before train-test splits (screenshot below). However, this is incorrect; scaling and centering should be done after splitting and only on the training set (
scaler.fit_transform(X_train)
). The parameters derived from the train set should then be applied to the test set (scaler.transform(X_test)
) to prevent data leakage and biasing the model. The test set should be treated as completely new/unseen data to the model, or else it's no longer generalizable.Also wanted to bring up a super minor nitpick for variable conventions. I believe ML and linear algebra typically keep X uppercase and y lowercase, since X is a matrix, while y is (often) a vector.
Thank you for the fun semester so far,
Jerry
The text was updated successfully, but these errors were encountered: