-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathChapter_4_labs_Logistic_Regression.Rmd
240 lines (166 loc) · 7.8 KB
/
Chapter_4_labs_Logistic_Regression.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
---
title: "4. Classification labs"
author: "Russ Conte"
date: "8/7/2021"
output: html_document
---
## 4.6 Lab: Logistic Regression, LDA, QDA, and KNN
## 4.6.1 The Stock Market Data set
From the text:
This data set consists of percentage returns for the S&P 500 stock index over 1, 250 days, from the beginning of 2001 until the end of 2005. For each date, we have recorded the percentage returns for each of the five previous trading days, Lag1 through Lag5. We have also recorded Volume (the number of shares traded on the previous day, in billions), Today (the percentage return on the date in question) and Direction (whether the market was Up or Down on this date).
```{r the Stock Market data set}
library(ISLR)
attach(Smarket)
summary(Smarket)
```
## the cor() function
from the text:
The cor() function produces a matrix that contains all of the pairwise correlations among the predictors in a data set.
```{r all pairwise correlations}
cor(Smarket[, -9])
plot(Volume)
```
## 4.6.2 Logistic Regression
```{r glm using logistic regression}
glm.fit = glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume, data = Smarket, family = binomial)
summary(glm.fit)
```
## find only the coefficients for the model
```{r}
coef(glm.fit)
```
## print a summary of the model
```{r summary of the glm model}
summary(glm.fit)
```
## How to use the predict function
```{r using the predict function}
glm.probs = predict(object = glm.fit,type = "response")
glm.probs[1:10]
glm.pred = rep("Down", 1250)
glm.pred[glm.probs > 0.5] = "Up"
table(glm.pred, Direction)
mean(glm.pred == Direction) # 0.5216, only slightly better than flipping a fair coin
```
from the text:
In order to better assess the ac- curacy of the logistic regression model in this setting, we can fit the model using part of the data, and then examine how well it predicts the held out data.
```{r setting up training and testing data subsets}
train = Year<2005
Smarket.2005 = Smarket[!train,]
dim(Smarket.2005)
Direction.2005 = Direction[!train]
```
## now fit a logistic regression model using only the subset of the observations that correspond to dates before 2005
```{r}
glm.fits = glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume, data = Smarket, family = binomial, subset = train)
glm.probs = predict(glm.fits, Smarket.2005, type = "response")
glm.pred = rep("Down", 252)
glm.pred[glm.probs > 0.5] = "Up"
table(glm.pred, Direction.2005)
mean(glm.pred == Direction.2005) # ouch!!
```
from the text:
Perhaps by removing the variables that appear not to be helpful in predicting Direction, we can obtain a more effective model.
```{r only using the most helpful variables}
glm.fit = glm(Direction ~ Lag1 + Lag2, data = Smarket,family = binomial, subset = train)
glm.probs = predict(glm.fits, newdata = Smarket.2005, type = "response")
glm.pred = rep("Down", 252)
glm.pred[glm.probs > 0.5] = "Up"
table(glm.pred, Direction.2005)
mean(glm.pred == Direction.2005) # 0.4801587 - worse than random guessing!
```
## 4.6.3 Linear Discriminant Analysis
from the text:
Now we will perform LDA on the Smarket data. In R, we fit an LDA model using the lda() function, which is part of the MASS library.
```{r}
library(MASS)
lda.fit = lda(Direction ~ Lag1 + Lag2, data = Smarket, subset = train)
lda.fit
```
from the text:
The LDA output indicates that πˆ1 = 0.492 and πˆ2 = 0.508; in other words, 49.2% of the training observations correspond to days during which the market went down. It also provides the group means; these are the average of each predictor within each class, and are used by LDA as estimates of μk.
The predict() function returns a list with three elements. The first ele- ment, class, contains LDA’s predictions about the movement of the market. The second element, posterior, is a matrix whose kth column contains the posterior probability that the corresponding observation belongs to the kth class, computed from (4.10). Finally, x contains the linear discriminants, described earlier.
```{r LDA predict example}
lda.pred = predict(lda.fit, Smarket.2005)
lda.class = lda.pred$class
table(lda.class, Direction.2005)
mean(lda.class == Direction.2005) # 0.5595238 A nice improvement!
```
## 4.6.4 Quadratic Discriminant Analysis
From the text:
We will now fit a QDA model to the Smarket data. QDA is implemented in R using the qda() function, which is also part of the MASS library. The syntax is identical to that of lda().
```{r}
qda.fit = qda(Direction ~ Lag1 + Lag2, data = Smarket, subset = train)
qda.fit
```
From the text:
The output contains the group means. But it does not contain the coef- ficients of the linear discriminants, because the QDA classifier involves a quadratic, rather than a linear, function of the predictors. The predict() function works in exactly the same fashion as for LDA.
```{r}
qda.class = predict(qda.fit, Smarket.2005)$class
table(qda.class, Direction.2005)
mean(qda.class == Direction.2005) # 0.5992063
```
# 4.6.5 K_Nearest Neighbors
from the text:
The function requires four inputs.
1. A matrix containing the predictors associated with the training data, labeled train.X below.
2. A matrix containing the predictors associated with the data for which we wish to make predictions, labeled test.X below.
3. A vector containing the class labels for the training observations, labeled train.Direction below.
4. A value for K, the number of nearest neighbors to be used by the classifier.
We use the cbind() function, short for column bind, to bind the Lag1 and Lag2 variables together into two matrices, one for the training set and the other for the test set.
```{r KNN}
library(class)
train.X = cbind(Lag1, Lag2)[train,]
test.X = cbind(Lag1, Lag2)[!train,]
train.Direction = Direction[train]
set.seed(1)
knn.pred = knn(train.X, test.X, train.Direction, k = 1)
table(knn.pred, Direction.2005)
(43+83)/(43+83+68+58) # exactly 0.5 - crazy!!
```
## Use KNN with Carvan Insurance Data
```{r}
dim(Caravan)
attach(Caravan)
summary(Purchase)
```
From the text:
A good way to handle this problem is to standardize the data so that all variables are given a mean of zero and a standard deviation of one. Then all variables will be on a comparable scale. The scale() function does just this. In standardizing the data, we exclude column 86, because that is the qualitative Purchase variable.
```{r Standardizing the data}
standardized.X = scale(Caravan[, -86])
var(Caravan[,1])
var(Caravan[,2])
var(standardized.X[,1])
var(standardized.X[,2])
# Now every column of standardized.X has a standard deviation of one and a mean of zero.
```
from the text:
We now split the observations into a test set, containing the first 1,000 observations, and a training set, containing the remaining observations. We fit a KNN model on the training data using K = 1, and evaluate its performance on the test data.
```{r}
test = 1:1000
train.X = standardized.X[-test,]
test.X = standardized.X[test,]
train.Y = Purchase[-test]
test.Y = Purchase[test]
set.seed(1)
knn.pred = knn(train.X, test.X, train.Y, k = 1)
mean(test.Y != "No") # 0.059
```
Using K = 3, the success rate increases to 19 %, and with K = 5 the rate is 26.7 %. This is over four times the rate that results from random guessing. It appears that KNN is finding some real patterns in a difficult data set!
```{r}
table(knn.pred, test.Y)
9 / (68 + 9) # 11.6%, much better than before!
knn.pred = knn(train.X, test.X, train.Y, k = 3)
table(knn.pred, test.Y) # 19.2% even better than before!
knn.pred = knn(train.X, test.X, train.Y, k = 5)
table(knn.pred, test.Y) # 0.267, very nice compared to before!
```
## use logistic regression to predict who will buy insurance
```{r}
glm.fits = glm(Purchase ~., data = Caravan, family= binomial, subset = -test)
glm.probs = predict(glm.fits, Caravan[test,],type = "response")
glm.pred = rep("No", 1000)
glm.pred[glm.probs > 0.25] = "Yes"
table(glm.pred, test.Y)
11/(11+22) # 33% success, a very good result!
```