-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathlab1.Rmd
103 lines (62 loc) · 2.64 KB
/
lab1.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
---
title: "Lab 1"
subtitle: "Resampling"
date: "Assigned 10/14/20, Due 10/21/20"
output:
html_document:
toc: true
toc_float: true
theme: "journal"
css: "website-custom.css"
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE,
warning = FALSE,
message = FALSE)
library(tidyverse)
library(tidymodels)
```
### Read in the `train.csv` data.
```{r, data}
```
### 1. Initial Split
Split the data into a training set and a testing set as two named objects. Produce the `class` type for the initial split object and the training and test sets.
```{r, initial_split}
set.seed(3000)
```
### 2. Use code to show the proportion of the `train.csv` data that went to each of the training and test sets.
```{r}
```
### 3. *k*-fold cross-validation
Use 10-fold cross-validation to resample the training data.
```{r, resample}
set.seed(3000)
```
### 4. Use `{purrr}` to add the following columns to your *k*-fold CV object:
* *analysis_n* = the *n* of the analysis set for each fold
* *assessment_n* = the *n* of the assessment set for each fold
* *analysis_p* = the proportion of the analysis set for each fold
* *assessment_p* = the proportion of the assessment set for each fold
* *sped_p* = the proportion of students receiving special education services (`sp_ed_fg`) in the analysis and assessment sets for each fold
```{r, purrr}
```
### 5. Please demonstrate that that there are **no** common values in the `id` columns of the `assessment` data between `Fold01` & `Fold02`, and `Fold09` & `Fold10` (of your 10-fold cross-validation object).
```{r}
```
### 6. Try to answer these next questions without running similar code on real data.
For the following code `vfold_cv(fictional_train, v = 20)`:
* What is the proportion in the analysis set for each fold?
* What is the proportion in the assessment set for each fold?
### 7. Use Monte Carlo CV to resample the training data with 20 resamples and .30 of each resample reserved for the assessment sets.
```{r}
set.seed(3000)
```
### 8. Please demonstrate that that there **are** common values in the `id` columns of the `assessment` data between `Resample 8` & `Resample 12`, and `Resample 2` & `Resample 20`in your MC CV object.
```{r}
```
### 9. You plan on doing bootstrap resampling with a training set with *n* = 500.
* What is the sample size of an analysis set for a given bootstrap resample?
* What is the sample size of an assessment set for a given bootstrap resample?
* If each row was selected only once for an analysis set:
+ what would be the size of the analysis set?
+ and what would be the size of the assessment set?