-
Notifications
You must be signed in to change notification settings - Fork 0
/
Classification Trees_Assign2.rmd
269 lines (173 loc) · 8.3 KB
/
Classification Trees_Assign2.rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
---
output:
word_document: default
---
# Paula McCree-Bailey
## BAN 502 - Module 4 Assignment 2
### Classification Trees
```{r load library and dataset, message=FALSE}
#install.packages("rpart")
#install.packages("RColorBrewer")
#install.packages("rattle")
library(tidyverse)
library(caret) #for splitting functions
library(rpart) #for classification trees
library(RColorBrewer) #better visualization of classification trees
library(rattle) #better visualization of classification trees
parole = read.csv("parole.csv")
```
#Factor Conversion
```{r factor conversion}
parole = parole %>% mutate(male = as_factor(as.numeric(male))) %>%
mutate(male = fct_recode(male,
"Female" = "0",
"Male" = "1"))
parole = parole %>% mutate(race = as_factor(as.numeric(race))) %>%
mutate(race = fct_recode(race,
"Otherwise" = "2",
"White" = "1"))
parole = parole %>% mutate(state = as_factor(as.numeric(state))) %>%
mutate(state = fct_recode(state,
"Virginia" = "4",
"Louisiana" = "3",
"Kentucky" = "2",
"Other state" = "1"))
parole = parole %>% mutate(crime = as_factor(as.numeric(crime))) %>%
mutate(crime = fct_recode(crime,
"driving-related crime" = "4",
"drug-related crime" = "3",
"larceny" = "2",
"other crime" = "1"))
parole = parole %>% mutate(multiple.offenses = as_factor(as.numeric(multiple.offenses))) %>%
mutate(multiple.offenses = fct_recode(multiple.offenses,
"Otherwise" = "0",
"multiple offenses" = "1"))
parole = parole %>% mutate(violator = as_factor(as.numeric(violator))) %>%
mutate(violator = fct_recode(violator,
"completed parole" = "0",
"violated parole" = "1"))
str(parole)
```
**Task 1** Split the data into training and testing sets
```{r Split the data}
set.seed(12345) #important to replicate data
train.rows = createDataPartition(y = parole$violator, p=0.7, list = FALSE)
train = slice(parole, train.rows)
test = slice(parole, -train.rows)
```
**Task 2** Create a classification tree using all of the predictor variables to predict “violator” in the training set. Plot the tree.
```{r classification tree 1}
tree1 = rpart(train$violator~., train, method="class")
fancyRpartPlot(tree1)
```
**Task 3** For the tree created in Task 2, how would you classify a 40 year-old parolee from Louisiana who
served a 5 year prison sentence? Describe how you “walk through” the classification tree to arrive at your
answer
**Completing this question seems to depend on the race of the parole. **
**A - 40 year-old white parolee from Louisiana who served a 5 year prison sentence**
**The first decision box is the parolee from Virginia, Kentucky or Other State. The parole is from Louisiana, so the response is no. The next decision box is the parole white or another race. The parole is white. The final decision is the parole will complete parole.**
**A - 40 year-old non-white parolee from Louisiana who served a 5 year prison sentence**
**The first decision box is the parolee from Virginia, Kentucky or Other State. The parole is from Louisiana, so the response is no. The next decision box is the parole white or another race. The parole is non-white. The next decision box is the time serviced greater than 3.5 years. The response is yes. The next decision box is age less than 30. The response is no. The final is the parole will not complete their parole.**
**Task 4** Use the printcp function to evaluate tree performance as a function of the complexity parameter
(cp). What cp value should be selected?
**0.030303 is the cp value that should be selected.**
```{r printcp function}
printcp(tree1)
plotcp(tree1)
```
**Task 5** Prune the tree from Task 2 back to the cp value that you selected in Task 4. Do not attempt to plot the tree. You will find that the resulting tree is known as a “root”. A tree that takes the form of a root is essentially a naive model that assumes that the prediction for all observations is the majority class. Which class (category) in the training set is the majority class (i.e., has the most observations)?
**Completed parole is the majority class in the training set.**
Prune the tree (at minimum cross-validated error)
```{r Pruning}
tree2 = prune(tree1,cp= tree1$cptable[which.min(tree1$cptable[,"xerror"]),"CP"])
tree2
```
**Task 6*:** Use the unpruned tree from Task 2 to develop predictions for the training data. Use caret’s
confusionMatrix function to calculate the accuracy, specificity, and sensitivty of this tree on the training data. Note that we would not, in practice, use an unpruned tree as such a tree is very likely to overfit on new data.
**Accuracy, specificity, and sensitivty of this tree on the training data.**
**Accuracy : 0.9027;**
**Sensitivity : 0.9569;**
**Specificity : 0.4909 **
Predictions on training set
```{r Predictions on training set}
treepred = predict(tree1, train, type = "class")
head(treepred, n=50)
```
Caret confusion matrix and accuracy, etc. calcs
```{r confusion matrix}
confusionMatrix(treepred,train$violator,positive="completed parole")
```
**Task 7** Use the unpruned tree from Task 2 to develop predictions for the testing data. Use caret’s
confusionMatrix function to calculate the accuracy, specificity, and sensitivty of this tree on the testing data.
Comment on the quality of the model.
**Accuracy, specificity, and sensitivty of this tree on the testing data.**
**Accuracy : 0.896;**
**Sensitivity : 0.9553;**
**Specificity : 0.4348**
**The quality of this model is good The accuracy of the training model is 90.27% The accuracy of the testing model is 89.6%. The actuary decreased slightly when applied to the testing data. The Sensitivity and Specificity are relevantly the same. Again, this model should fit well with this dataset**
Predictions on testing set
```{r Predictions on testing set}
treepred_test = predict(tree1, newdata=test, type = "class")
head(treepred_test, n=24)
```
Caret confusion matrix and accuracy, etc. calcs
```{r}
confusionMatrix(treepred_test,test$violator,positive="completed parole")
```
**Task 8** Read in “Blood.csv” and complete factor on dataset
```{r read in dataset and factor}
blood = read.csv("Blood.csv")
blood = blood %>% mutate(DonatedMarch = as_factor(as.numeric(DonatedMarch))) %>%
mutate(DonatedMarch = fct_recode(DonatedMarch,
"No" = "0",
"Yes" = "1"))
str(blood)
```
**Task 9** Split the dataset into training (70%) and testing (30%) sets.
Then develop a classification tree on the training set to predict “DonatedMarch”. Evaluate the complexity parameter (cp) selection for this model.
**0.010000 is the best complexity parameter (cp) selection, which is after 4 partitions in the data. The error is 84.8%.**
```{r}
set.seed(1234)
train.rows = createDataPartition(y = blood$DonatedMarch, p=0.7, list = FALSE)
trainB = slice(blood, train.rows)
testB = slice(blood, -train.rows)
```
Create a classification tree
```{r classification tree}
treeB = rpart(trainB$DonatedMarch~., trainB, method="class")
fancyRpartPlot(treeB)
```
```{r}
printcp(treeB)
plotcp(treeB)
```
**Task 10** Prune the tree back to the optimal cp value, make predictions, and use the confusionMatrix
function on the both training and testing sets. Comment on the quality of the predictions.
**Using the cp value to determine the quality of the predictions, this model is okay (not really that good). The accuracy of the training model is 81.3% The accuracy of the testing model is 77.68%. The actuary decreased slightly when applied to the testing data. Both models are better than their naive models. The Sensitivity and Specificity are relevantly the same.**
Prune the tree (at minimum cross-validated error)
```{r prune tree}
treeBp = prune(treeB,cp= treeB$cptable[which.min(treeB$cptable[,"xerror"]),"CP"])
treeBp
```
```{r cp value}
printcp(treeBp)
plotcp(treeBp)
```
Predictions on training set
```{r Predictions on training}
treepredB = predict(treeB, trainB, type = "class")
head(treepredB, n=10)
```
Caret confusion matrix and accuracy, etc. calcs
```{r Caret confusion matrix on training}
confusionMatrix(treepredB,trainB$DonatedMarch,positive="Yes")
```
Predictions on testing set
```{r Predictions on testing}
treepred_testB = predict(treeB, testB, type = "class")
head(treepred_testB)
```
Caret confusion matrix and accuracy, etc. calcs
```{r Caret confusion matrix on testing}
confusionMatrix(treepred_testB,testB$DonatedMarch,positive="Yes")
```