class: middle, center, title-slide
Lecture 2: Trees and ensembles
- non linear problems?
- redundant features?
- categorical features?
- different scales per feature?
Today is about Random forests and gradient boosted trees.
class: middle, center
(Chapter 3)
What should we do today?
Put new sample through our tree. When we arrive at a leaf predict that this sample belongs to the majority class.
Trees are very fast to execute at prediction/inference time.
The goal is to create leaves that are pure so we can use majority class as prediction for new samples.
- finding the optimal partioning of the input space is (in general) not feasible
- grow tree in a greedy (one step at a time) fashion!
Algorithm:
- Pick a node, check if node is pure
- if yes mark it as a leaf
- else:
- find split point
$S$ for feature$j$ that leads to largest decrease of impurity - split node into two child nodes according to this split
- find split point
- go to 1.
How full is the train? vs What is the weather like?
Trying to predict if a train will be .redbg[delayed] or .bluebg[on time].
For each node find the split point
-
$I_T$ : impurity of current node -
$I_L$ : impurity of left child node -
$I_R$ : impurity of right child node
Two common measures for classification tasks:
- cross-entropy:
$- \sum_{k} p_{mk} \log(p_{mk})$ - Gini index:
$\sum_k p_{mk}(1-p_{mk})$
Growing a fully developed tree generally does not lead to good generalisation.
Either limit tree growth or prune tree after it has been grown.
Implemented in scikit-learn:
max_depth
max_leaf_nodes
min_samples_split
min_impurity_decrease
Prediction:
Impurity measures:
- mean squared error:
$$\frac{1}{N_m} \sum_{i \in N_m} (y_i - \bar{y}_m)^2 $$ - mean absolute error:
$$\frac{1}{N_m} \sum_{i \in N_m} |y_i - \bar{y}_m|$$
Two trees fitted on two random subsets of the same data
class: middle, center
show how accuracy plateaus after a while
class: middle, center
(Chapter 4)
A crowd of non experts can give you a very good estimate if you ask each individually.
Combine several uncorrelated models together:
Combine predictions from logistic regression and a decision tree.
- Accuracy for LogisticRegression: 0.84
- Accuracy for DecisionTree: 0.80
- Accuracy for combination: 0.88
How could we build models that are uncorrelated with each other?
How could we build models that are uncorrelated with each other?
- only show a subset of the data to each model
- only consider a subset of the features at each split
Bootstrap (sample with replacement):
Select subset of features:
Many trees, decorrelate via bootstrap and sampling features at each split.
.center[
Weighted mean decrease of impurity.
.footnote[G. Louppe, Understanding Random Forests, https://github.com/glouppe/phd-thesis ]
- Main parameter: max_features
- around
sqrt(n_features)
for classification - around
n_features
for regression
- around
n_estimators
> 100- Restricting tree growth might help, definitely helps with model size!
max_depth
,max_leaf_nodes
,min_samples_split
class: middle, center
(Chapter 4)
At each step you try and fix the mistakes made by the previous model. The model is fitted to the residuals of the previous step.
At each step you try and fix the mistakes made by the previous model. The model is fitted to the residuals of the previous step.
In practice you want to make small adjustments as you go along, each
model is multiplied by
class: middle, center
???
show that you can overfit if you add more and more trees. Unlike RF.
Pointer to Catboost and dynamic boost, claims to not overfit.
Weighted "mean decrease of impurity".
.footnote[G. Louppe, Understanding Random Forests, https://github.com/glouppe/phd-thesis ]
max_features
tends to be small- smaller learning rate requires more trees
- tune with ~1000 trees (or as big as you have patience)
Once you have good parameters increase n_estimators
and decrease learning rate.
Fully scikit-learn compatible, but faster!
from xgboost import XGBClassifier
xgb = XGBClassifier()
xgb.fit(X_train, y_train)
xgb.score(X_test, y_test))
Install it with conda install -c conda-forge xgboost
.
Used by a lot of people who are "serious" about gradient boosted trees.
Gradient boosting framework develped by Microsoft and it is open-source! Supports parallel and GPU learning.
import lightgbm as lgb
estimator = lgb.LGBMRegressor(num_leaves=31)
param_grid = {
'learning_rate': [0.01, 0.1, 1],
'n_estimators': [20, 40]
}
gbm = GridSearchCV(estimator, param_grid)
gbm.fit(X_train, y_train)
print('Best parameters found by grid search are:',
gbm.best_params_)
Set of benchmarks comparing xgboost and lightgbm.
Looks a bit tricky to install :-/
If you want to see the idea of boosting in action in a different context checkout the Viola-Jones object detection algorithm.
Why limit yourself to combining trees?
Stacking, or combining different types of models.
voting = VotingClassifier([('logreg',
LogisticRegression(C=100)),
('tree',
DecisionTreeClassifier(max_depth=5)),
('knn',
KNeighborsClassifier(n_neighbors=3))
],
voting='soft', flatten_transform=True)
voting.fit(X_train, y_train)
Can't we learn the weights?
# `voting` is our original voting classifier,
# when you call `transform()` on it it produces
# class probabilites
stacking = make_pipeline(voting,
# only keep probabilites for one class
FunctionTransformer(lambda X: X[:, 1::2]),
# fit a logistic regression model
LogisticRegression())
stacking.fit(X_train, y_train)
print(stacking.score(X_train, y_train))
# -> 0.92
print(stacking.score(X_test, y_test))
# -> 0.85
What is the problem now?
Fit the original models on a subset of the data, predict on the rest. This way you get unbiased predictions for all of the data.
.width-40[]
.blackbg.black[pre] = predict and .whitebg.white[fit] = fit.
from sklearn.model_selection import cross_val_predict
first_stage = make_pipeline(voting,
FunctionTransformer(
lambda X: X[:, 1::2])
)
transform_cv = cross_val_predict(first_stage, X_train, y_train,
cv=10, method="transform")
from sklearn.model_selection import cross_val_predict
first_stage = make_pipeline(voting,
FunctionTransformer(
lambda X: X[:, 1::2])
)
# `transform_cv` will contain unbiased predictions
# for each sample
transform_cv = cross_val_predict(first_stage, X_train, y_train,
cv=5, method="transform")
second_stage = LogisticRegression().fit(transform_cv, y_train)
print(second_stage.coef_)
print(second_stage.score(transform_cv, y_train))
# -> 0.82
print(second_stage.score(first_stage.transform(X_test), y_test))
# -> 0.85
- non linear problems? No problem!
- redundant features? No problem!
- categorical features? No problem!
- different scales per feature? No problem!
Random forests should be in your baseline. There is essentially no reason not to use them. Might not produce the absolute best solution.
Gradient boosted trees is the go to solution for most real world problems. Needs some careful tuning.
Reading:
- Random forests: chapter 15 of "Elements of Statistical Learning"
- Boosted trees: chapter 10.9 of "Elements of Statistical Learning"