-
Notifications
You must be signed in to change notification settings - Fork 23
/
README.rmd
97 lines (76 loc) · 3.6 KB
/
README.rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
---
title: "README"
author: "Eric Graves"
date: "October 12, 2016"
output: md_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, fig.width=5, fig.height=5, fig.align='center')
library(isofor)
```
## Isolation Forest
An Isolation Forest is an ensemble of completely random decision trees. At each split a random
feature and a random split point is chosen. Anomalies are isolated if they end up in a partition
far away from the rest of the data. In decision tree terms, this corresponds to a record that
has a short "path length". The path length is the number of nodes that a record passes through
before terminating in a leaf node. Records with short average path lengths through the entire
ensemble are considered anomalies.
## An analogy
Describing the location of a country home takes many fewer directions than describing the
location of a brownstone in Brooklyn. The country home might be described as "the only house
on the south shore of Lake Woebegon". While directions to the brownstone must be qualified
with much more detail: "Go north on 5th Street for 12 blocks, take a left on Van Buren, etc.."
Isolated | Dense
:-------------------------:|:-------------------------:
<img src="README_files/figure-markdown_strict/2_river-house.jpg" height="250"/> | <img src="README_files/figure-markdown_strict/Brooklyn-brownstones.jpg" height="250"/>
The country house in this example is a literal outlier. It is off by itself away from most other
homes. Similarly, records that can be described succinctly are also outliers.
## Example
Here we create two random, normal vectors and add some outliers. The majority of the data
points are centered around (0, 0) with a standard deviation of 1/2. 50 outliers are introduced
and are centered around (-1.5, 1.5) with a standard deviation of 1. This is to encourage some
co-mingling of outliers with the bulk of the data.
```{r dummy-data}
N = 1e3
x = c(rnorm(N, 0, 0.5), rnorm(N*0.05, -1.5, 1))
y = c(rnorm(N, 0, 0.5), rnorm(N*0.05, 1.5, 1))
ol = c(rep(0, N), rep(1, (0.05*N))) + 2
data = data.frame(x, y)
plot(data, pch=ol)
title("Dummy data with outliers")
```
The code below builds an Isolation Forest by passing in the dummy data, the number of
trees requested (100) and the number of records to subsample for each tree (32). The
records that exceed the 95% percentile of the anomaly score should flag the most anomalous
records. By coloring such records as red and plotting the results the effectiveness of
the Isolation Forest can be viewed.
```{r isofor}
mod = iForest(X = data, 100, 32)
p = predict(mod, data)
col = ifelse(p > quantile(p, 0.95), "red", "blue")
plot(x, y, col=col, pch=ol)
```
Knowing there are two populations, the Kmeans algorithm seems like a good fit for identifying
the two clusters. However, we can see that it picks cluster centers that do not do a good job
of separating the data.
```{r kmeans}
km = kmeans(data, 2)
plot(x, y, col=km$cluster+1, pch=ol)
```
## Node Membership
In addition to predicting an anomaly score, a node membership matrix can also be
predicted. There are two ways to produce a nod membership matrix. The first
returns a matrix of size `n_obs x n_trees` and contains the terminal node ID for
each observation:
```{r node-ids}
node_ids = predict(mod, data, nodes=TRUE)
head(node_ids[,1:10])
```
THe second uses the Matrix library to return a sparse matrix of 1s and 0s
indicating node membership. There are as many ones as there are `n_obs x n_trees`
but the dimension is much larger at `n_obs x n_terminal_nodes`:
```{r node-membership}
library(Matrix)
nodes = predict(mod, data, sparse=TRUE)
head(nodes[,1:10])
```