Advanced Computing Techniques

Lecture 2: Trees and ensembles

Real world data

non linear problems?
redundant features?
categorical features?
different scales per feature?

Today is about Random forests and gradient boosted trees.

Decision trees

(Chapter 3)

A decision tree

What should we do today?

Dataset

.center.width-80[]

One level of splits

.center.width-80[]

One level of splits

.center.width-80[]

Two levels of splits

.center.width-80[]

Two levels of splits

.center.width-80[]

Three levels of splits

.center.width-80[]

Three levels of splits

.center.width-80[]

Ten levels of splits

.center.width-80[]

Ten levels of splits

.center.width-80[]

Making predictions

.center.width-80[]

Put new sample through our tree. When we arrive at a leaf predict that this sample belongs to the majority class.

$P(R | X) = \frac{R}{B+R}$ and $P(B | X) = \frac{B}{B+R}$

Trees are very fast to execute at prediction/inference time.

How to grow a tree?

The goal is to create leaves that are pure so we can use majority class as prediction for new samples.

finding the optimal partioning of the input space is (in general) not feasible
grow tree in a greedy (one step at a time) fashion!

Algorithm:

Pick a node, check if node is pure
- if yes mark it as a leaf
- else:
  1. find split point $S$ for feature $j$ that leads to largest decrease of impurity
  2. split node into two child nodes according to this split
go to 1.

Two questions, which one is better?

How full is the train? vs What is the weather like?

.center.width-100[]

Trying to predict if a train will be .redbg[delayed] or .bluebg[on time].

Node splitting

For each node find the split point $S$ on feature $j$ that leads to largest decreases of the impurity.

$$\Delta = I_T - \frac{n_L}{N} I_L - \frac{n_R}{N} I_R$$

$I_T$: impurity of current node
$I_L$: impurity of left child node
$I_R$: impurity of right child node

$I_L$ and $I_R$ depend on the feature $j$ and split point $S$ chosen to split samples in the current node.

Measure impurity of a node

Two common measures for classification tasks:

cross-entropy: $- \sum_{k} p_{mk} \log(p_{mk})$
Gini index: $\sum_k p_{mk}(1-p_{mk})$

.center.width-50[]

Tree stops growing when nodes are pure

.center.width-80[]

Tree structure

Growing a fully developed tree generally does not lead to good generalisation.

Either limit tree growth or prune tree after it has been grown.

Implemented in scikit-learn:

max_depth
max_leaf_nodes
min_samples_split
min_impurity_decrease

No limit on tree growth

.center.width-80[]

`max_depth = 3`

.center.width-80[]

`max_leaf_nodes = 8`

.center.width-50[]

`min_samples_split = 40`

.center.width-60[]

Regression trees

Prediction:

$$ \bar{y}_m = \frac{1}{N_m} \sum_{i \in N_m} y_i $$

Impurity measures:

mean squared error: $$\frac{1}{N_m} \sum_{i \in N_m} (y_i - \bar{y}_m)^2 $$
mean absolute error: $$\frac{1}{N_m} \sum_{i \in N_m} |y_i - \bar{y}_m|$$

Trees have high variance

Two trees fitted on two random subsets of the same data

.pull-left.width-100[![](images/dt_unstable1.svg)] .pull-right.width-100[![](images/dt_unstable2.svg)]

Interlude (decision trees, tune max_depth)

show how accuracy plateaus after a while

Random Forests

(Chapter 4)

Wisdom of crowds

A crowd of non experts can give you a very good estimate if you ask each individually.

.center.width-80[]

Wisdom of classifier crowds

Combine several uncorrelated models together:

.width-100[![](images/voting_lr.png)] .width-100[![](images/voting_dt.png)] .width-100[![](images/voting_combined.png)]

Combine predictions from logistic regression and a decision tree.

Accuracy for LogisticRegression: 0.84
Accuracy for DecisionTree: 0.80
Accuracy for combination: 0.88

The key: uncorrelated models

How could we build models that are uncorrelated with each other?

The key: uncorrelated models

How could we build models that are uncorrelated with each other?

only show a subset of the data to each model
only consider a subset of the features at each split

Bootstrap (sample with replacement):

.center.width-90[]

Select subset of features:

.center.width-90[]

Random forests

Many trees, decorrelate via bootstrap and sampling features at each split.

.width-90[![](images/rf_trees_0.png)] .width-90[![](images/rf_trees_1.png)] .width-90[![](images/rf_trees_2.png)]

]

.center[ .width-30[] ]

Feature importances

Weighted mean decrease of impurity.

.center.width-60[]

Tuning random forests

Main parameter: max_features
- around sqrt(n_features) for classification
- around n_features for regression
n_estimators > 100
Restricting tree growth might help, definitely helps with model size!
- max_depth, max_leaf_nodes, min_samples_split

Gradient boosting

(Chapter 4)

Step by step example

$$f_1(x) \approx y$$

$$f_2(x) \approx y - f_1(x)$$

$$f_3(x) \approx y - f_1(x) - f_2(x)$$

At each step you try and fix the mistakes made by the previous model. The model is fitted to the residuals of the previous step.

Regression example

.center.width-70[]

Stage 0

.center.width-70[]

Stage 0 - residuals

.center.width-70[]

Stage 1

.center.width-70[]

Stage 1 - residuals

.center.width-70[]

Stage 2

.center.width-70[]

Stage 2 - residuals

.center.width-70[]

Stage 3

.center.width-70[]

Full model

.center.width-70[]

Step by step example

$$f_1(x) \approx y$$

$$f_2(x) \approx y - \alpha f_1(x)$$

$$f_3(x) \approx y - \alpha f_1(x) - \alpha f_2(x)$$

$$f_4(x) \approx y - \alpha f_1(x) - \alpha f_2(x) - \alpha f_3(x)$$

$$ ... $$

At each step you try and fix the mistakes made by the previous model. The model is fitted to the residuals of the previous step.

In practice you want to make small adjustments as you go along, each model is multiplied by $\alpha$. This is also referred to as "shrinkage" or "learning rate".

Interlude 2

???

show that you can overfit if you add more and more trees. Unlike RF.

Pointer to Catboost and dynamic boost, claims to not overfit.

Feature importances

Weighted "mean decrease of impurity".

.center.width-60[]

Partial dependence plots

.center.width-70[]

Partial dependence plots

.center.width-100[]

Tuning gradient boosting

max_features tends to be small
smaller learning rate requires more trees
tune with ~1000 trees (or as big as you have patience)

.center.width-80[]

Once you have good parameters increase n_estimators and decrease learning rate.

XGBoost

Fully scikit-learn compatible, but faster!

from xgboost import XGBClassifier

xgb = XGBClassifier()
xgb.fit(X_train, y_train)
xgb.score(X_test, y_test))

Install it with conda install -c conda-forge xgboost.

Used by a lot of people who are "serious" about gradient boosted trees.

LightGBM

Gradient boosting framework develped by Microsoft and it is open-source! Supports parallel and GPU learning.

import lightgbm as lgb

estimator = lgb.LGBMRegressor(num_leaves=31)

param_grid = {
    'learning_rate': [0.01, 0.1, 1],
    'n_estimators': [20, 40]
}

gbm = GridSearchCV(estimator, param_grid)
gbm.fit(X_train, y_train)
print('Best parameters found by grid search are:',
      gbm.best_params_)

Set of benchmarks comparing xgboost and lightgbm.

Looks a bit tricky to install :-/

Viola Jones for face detection

If you want to see the idea of boosting in action in a different context checkout the Viola-Jones object detection algorithm.

.center.width-100[]

Stacking

Why limit yourself to combining trees?

Stacking, or combining different types of models.

voting = VotingClassifier([('logreg',
                            LogisticRegression(C=100)),
                           ('tree',
                            DecisionTreeClassifier(max_depth=5)),
                           ('knn',
                            KNeighborsClassifier(n_neighbors=3))
                          ],
                         voting='soft', flatten_transform=True)
voting.fit(X_train, y_train)

Combine models by averaging

.width-100[![](images/voting_Logistic%20Regression.png)] .width-100[![](images/voting_KNN.png)] .width-100[![](images/voting_Decision Tree.png)]

.center.width-40[![](images/voting_Average.png)]

Can't we learn the weights?

Combine models via LogisticRegression

# `voting` is our original voting classifier,
# when you call `transform()` on it it produces
# class probabilites
stacking = make_pipeline(voting,
                         # only keep probabilites for one class
                         FunctionTransformer(lambda X: X[:, 1::2]),
                         # fit a logistic regression model
                         LogisticRegression())
stacking.fit(X_train, y_train)
print(stacking.score(X_train, y_train))
# -> 0.92
print(stacking.score(X_test, y_test))
# -> 0.85

What is the problem now?

Need unbiased predictions

Fit the original models on a subset of the data, predict on the rest. This way you get unbiased predictions for all of the data.

from sklearn.model_selection import cross_val_predict

first_stage = make_pipeline(voting,
                            FunctionTransformer(
                                lambda X: X[:, 1::2])
                            )
transform_cv = cross_val_predict(first_stage, X_train, y_train,
                                 cv=10, method="transform")

Full stacking

from sklearn.model_selection import cross_val_predict

first_stage = make_pipeline(voting,
                            FunctionTransformer(
                                lambda X: X[:, 1::2])
                            )
# `transform_cv` will contain unbiased predictions
# for each sample
transform_cv = cross_val_predict(first_stage, X_train, y_train,
                                 cv=5, method="transform")

second_stage = LogisticRegression().fit(transform_cv, y_train)
print(second_stage.coef_)

print(second_stage.score(transform_cv, y_train))
# -> 0.82
print(second_stage.score(first_stage.transform(X_test), y_test))
# -> 0.85

Summary

non linear problems? No problem!
redundant features? No problem!
categorical features? No problem!
different scales per feature? No problem!

Random forests should be in your baseline. There is essentially no reason not to use them. Might not produce the absolute best solution.

Gradient boosted trees is the go to solution for most real world problems. Needs some careful tuning.

Reading:

Random forests: chapter 15 of "Elements of Statistical Learning"
Boosted trees: chapter 10.9 of "Elements of Statistical Learning"

Files

lecture2.md

Latest commit

History

lecture2.md

File metadata and controls

Advanced Computing Techniques

Real world data

Decision trees

A decision tree

Dataset

One level of splits

One level of splits

Two levels of splits

Two levels of splits

Three levels of splits

Three levels of splits

Ten levels of splits

Ten levels of splits

Making predictions

How to grow a tree?

Two questions, which one is better?

Node splitting

Measure impurity of a node

Tree stops growing when nodes are pure

Tree structure

No limit on tree growth

max_depth = 3

max_leaf_nodes = 8

min_samples_split = 40

Regression trees

Trees have high variance

Interlude (decision trees, tune max_depth)

Random Forests

Wisdom of crowds

Wisdom of classifier crowds

The key: uncorrelated models

The key: uncorrelated models

Random forests

.center[ .width-30[] ]

Feature importances

Tuning random forests

Gradient boosting

Step by step example

Regression example

Stage 0

Stage 0 - residuals

Stage 1

Stage 1 - residuals

Stage 2

Stage 2 - residuals

Stage 3

Full model

Step by step example

Interlude 2

Feature importances

Partial dependence plots

Partial dependence plots

Tuning gradient boosting

XGBoost

LightGBM

Viola Jones for face detection

Stacking

Combine models by averaging

Combine models via LogisticRegression

Need unbiased predictions

Full stacking

Summary

`max_depth = 3`

`max_leaf_nodes = 8`

`min_samples_split = 40`