-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy path01_intro.Rmd
324 lines (257 loc) · 12.3 KB
/
01_intro.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
---
title: "01_intro"
output: html_document
---
class: middle, center, inverse
# 2 Introduction to `tidymodels`
---
background-image: url(https://www.tidymodels.org/images/tidymodels.png)
background-position: 97.5% 2.5%
background-size: 7.5%
layout: true
---
## 2 Introduction to `tidymodels`
> The tidymodels framework is a collection of packages for modeling and machine learning using tidyverse principles. ~ [tidymodels.org](https://www.tidymodels.org/)
.pull-left[.center[
```{r, echo=F, out.width='40%', out.height='40%'}
knitr::include_graphics("https://raw.githubusercontent.com/tidymodels/tidymodels/master/tidymodels_hex.png")
```
Official `tidymodels` [Hex Sticker](https://github.com/rstudio/hex-stickers)
]]
.pull-right[
.pull-left[
```{r, echo=F, out.width='80%', out.height='80%'}
knitr::include_graphics("https://avatars.githubusercontent.com/u/12505835?v=4")
```
**Julia Silge** - Software Engineer @ RStudio
]
.pull-right[
```{r, echo=F, out.width='80%', out.height='80%'}
knitr::include_graphics("https://avatars.githubusercontent.com/u/5731043?v=4")
```
**Max Kuhn** - Software Engineer @ RStudio
]]
--
> Whenever possible, the software should be able to protect users from committing mistakes. Software should make it easy for users to do the right thing. ~ [Kuhn/Silge (2021)](https://www.tmwr.org/software-modeling.html#software-modeling)
???
- a framework for modeling (guardrails) using using tidy data principles
- very similar to the unified `scikit-learn` package in the context of `Python`
- by the way, this is general a central distinction between R and Python: Python advocates the paradigm of having one unified approach for every problem (which makes it at times also less flexible)
---
## 2 Introduction to `tidymodels`
> The tidymodels framework is a **collection of packages** for modeling and machine learning using tidyverse principles. ~ [tidymodels.org](https://www.tidymodels.org/)
.pull-left[
**`tidymodels` core packages:**
- `rsample`: general methods for resampling
- `recipes`: unified interface to data preprocessing
- `parsnip`: unified interface to modeling
- `workflows`: combine model blueprints and preprocessing recipes
- `dials`: create tuning parameters
- `tune`: hyperparameter tuning
- `broom`: tidy model outputs
- `yardstick`: model evaluation
]
.pull-right[
```{r, echo=F, out.width='85%', out.height='85%', fig.align='center'}
knitr::include_graphics("./img/tidymodels-hex.PNG")
```
]
???
- tidymodels can be viewed as another meta-package that shares the design philosophy, grammar and data structures of the tidyverse
- each package has its own goal which makes tidymodels a modular collection of package
- A goal of the tidymodels packages is that the interfaces to common tasks are standardized
- we will discuss each package along the modeling workflow: resampling, preprocessing, model building, hyperparameter tuning, model evaluation
---
## 2 Introduction to `tidymodels`
> The tidymodels framework is a **collection of packages** for modeling and machine learning using tidyverse principles. ~ [tidymodels.org](https://www.tidymodels.org/)
```{r, eval=F}
install.packages("tidymodels")
library(tidymodels)
```
```
-- Attaching packages ----------------------------- tidymodels 0.1.4 --
v broom 0.7.9 v recipes 0.1.17
v dials 0.0.10 v rsample 0.1.0
v dplyr 1.0.7 v tibble 3.1.4
v ggplot2 3.3.5 v tidyr 1.1.4
v infer 1.0.0 v tune 0.1.6
v modeldata 0.1.1 v workflows 0.2.3
v parsnip 0.1.7 v workflowsets 0.1.0
v purrr 0.3.4 v yardstick 0.0.8
-- Conflicts ------------------------------- tidymodels_conflicts() --
x purrr::discard() masks scales::discard()
x dplyr::filter() masks stats::filter()
x dplyr::lag() masks stats::lag()
x recipes::step() masks stats::step()
* Use suppressPackageStartupMessages() to eliminate package startup messages
```
???
Explain:
- very similar when you load the whole tidyverse
- as you can see tidymodels loads also some of the tidyverse packages (however, usually you would load both at the beginning of your R session) -> this means that some tidymodels functions also use dplyr, purrr and ggplot2 functionality
- again we have some conflicts here, so these functions override functions by the base `R` `stats` package
- `tidymodels v0.1.4`: relatively new package ecosystem, it is not unlikely that some of the features or function interfaces will change slightly in the future
---
## 2 Introduction to `tidymodels`
Remember, modeling is one of the main steps in our day-2-day data science workflow. And this is precisely where `tidymodels` fits in!
<br><br><br>
```{r, echo=F, out.width='75%', out.height='75%', fig.align='center'}
knitr::include_graphics("https://www.tmwr.org/premade/data-science-model.svg")
```
.center[
*Source: [Kuhn/Silge (2021), ch. 1.3](https://www.tmwr.org/software-modeling.html#model-phases)*
]
---
layout: false
class: middle, center, inverse
# 3 Himalayan Climbing<br>Expeditions Data
---
## 3 Himalayan Climbing Expeditions Data
In order to illustrate the features of the `tidymodels` ecosystem, we use the [Himalayan Climbing Expeditions](https://github.com/rfordatascience/tidytuesday/blob/master/data/2020/2020-09-22/readme.md) data set from the [`tidytuesday` project](https://github.com/rfordatascience/tidytuesday).
```{r, cache=T, results='hide', message=F}
# install.packages("tidytuesdayR")
tt_data <- tidytuesdayR::tt_load(2020, week = 39)
```
```
> --- Compiling #TidyTuesday Information for 2020-09-22 ----
> --- There are 3 files available ---
> --- Starting Download ---
>
> Downloading file 1 of 3: `peaks.csv`
> Downloading file 2 of 3: `members.csv`
> Downloading file 3 of 3: `expeditions.csv`
>
> --- Download complete ---
```
???
- Tidytuesday: social project to motivate the R online community to learn working with tools like ggplot2, dplyr and tidyr and applying them to real-world data
- around 50 different data sets right now
- this dataset consists of three different csv files
---
## 3 Himalayan Climbing Expeditions Data
The data set contains a large record of data spanning the 1905-2019 period about
- `r emo::ji("mountain_snow")` the several **peaks** of the mountain range,
- `r emo::ji("paw_prints")` the conducted **expeditions** during this period, and
- `r emo::ji("woman_climbing")` the **members** of each expedition.
--
<br>
**Task:** Predict the likelihood of an expedition coming to a lethal end (i.e. *binary classification task*).
```{r, eval=F}
tt_data$members %>%
skimr::skim()
```
```
> Output on next slide
```
???
- Motivations for the task: derive drivers for a successful expedition and eventually reduce death rates.
- use `skimr` package to get a high-level view of the data and most important descriptives
---
## 3 Himalayan Climbing Expeditions Data
.panelset[
.panel[
.panel-name[Data Summary]
```
> -- Data Summary ------------------------
> Values
> Name Piped data
> Number of rows 76519
> Number of columns 21
> _______________________
> Column type frequency:
> character 10
> logical 6
> numeric 5
> ________________________
> Group variables None
```
]
.panel[
.panel-name[Character Vars]
```
> -- Variable type: character ---------------------------------------------------------------------------
> # A tibble: 10 x 8
> skim_variable n_missing complete_rate min max empty n_unique whitespace
> * <chr> <int> <dbl> <int> <int> <int> <int> <int>
> 1 expedition_id 0 1 9 9 0 10350 0
> 2 member_id 0 1 12 12 0 76518 0
> 3 peak_id 0 1 4 4 0 391 0
> 4 peak_name 15 1.00 4 25 0 390 0
> 5 season 0 1 6 7 0 5 0
> 6 sex 2 1.00 1 1 0 2 0
> 7 citizenship 10 1.00 2 23 0 212 0
> 8 expedition_role 21 1.00 4 25 0 524 0
> 9 death_cause 75413 0.0145 3 27 0 12 0
> 10 injury_type 74807 0.0224 3 27 0 11 0
```
]
.panel[
.panel-name[Logical Vars]
```
> -- Variable type: logical -----------------------------------------------------------------------------
> # A tibble: 6 x 5
> skim_variable n_missing complete_rate mean count
> * <chr> <int> <dbl> <dbl> <chr>
> 1 hired 0 1 0.206 FAL: 60788, TRU: 15731
> 2 success 0 1 0.382 FAL: 47320, TRU: 29199
> 3 solo 0 1 0.00158 FAL: 76398, TRU: 121
> 4 oxygen_used 0 1 0.238 FAL: 58286, TRU: 18233
> 5 died 0 1 0.0145 FAL: 75413, TRU: 1106
> 6 injured 0 1 0.0224 FAL: 74806, TRU: 1713
```
]
.panel[
.panel-name[Numeric Vars]
```
> -- Variable type: numeric -----------------------------------------------------------------------------
> # A tibble: 5 x 11
> skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
> * <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
> 1 year 0 1 2000. 14.8 1905 1991 2004 2012 2019 ▁▁▁▃▇
> 2 age 3497 0.954 37.3 10.4 7 29 36 44 85 ▁▇▅▁▁
> 3 highpoint_metres 21833 0.715 7471. 1040. 3800 6700 7400 8400 8850 ▁▁▆▃▇
> 4 death_height_metres 75451 0.0140 6593. 1308. 400 5800 6600 7550 8830 ▁▁▂▇▆
> 5 injury_height_metres 75510 0.0132 7050. 1214. 400 6200 7100 8000 8880 ▁▁▂▇▇
```
]
]
???
**Pt. 1:**
- total of 76,519 expedition members
- categorization of data types
**Pt. 2:**
- three id columns, these are likely not supposed to end up in any predictive model -> in any case, if you have an id variable with predictive value you should question in the data generating process behind the id column
- 391 different peaks, but only 390 different peak names
- with 76,519 climbers, almost 1000 died (75,413 non-death causes), and another 600 came back injured (74,807 non-injured) -> imbalanced prediction task
- 524 different expedition roles
- why do we have five seasons? (probably an unknown category)
**Pt. 3:**
logical:
- never missing
- `hired` natives (around 20% of the expedition members)
- only 38% expeditions made it to the top (`success`)
- likely we can have expeditions that were successful, but where one or several member died
- died and injured corresponds to the numbers of `death_cause` and `injury_type`
**Pt. 4:**
numeric:
- hist of `year` expeditions took place more and more often in the two recent decades
- `age`: most climbers i would expect to be between 20-40, with few very old climbers (85), and some super young (7?!)
- `age` and `highpoint_metres` has a lot of missings!
usually, you would do a lot more EDA right now:
- plot of expedition year against success/failure rates -> more recent expeditions likely more successful as you know more about the region/have better equipment
- plot of age against success/failure rates -> younger, more athletic climbers more successful?
- check which peaks or seasons are most associated with climber deaths
- check if oxygen use is associated with death rates
- good practice is always to do a correlation matrix
---
## 3 Himalayan Climbing Expeditions Data
```{r}
climbers_df <- tt_data$members %>%
select(member_id, peak_name, season, year, sex, age, citizenship,
expedition_role, hired, solo, oxygen_used, success, died) %>%
filter((!is.na(sex) & !is.na(citizenship) & !is.na(peak_name) & !is.na(expedition_role)) == T) %>%
mutate(across(where(~ is.character(.) | is.logical(.)), as.factor))
climbers_df
```
???
Note: After the removal of missing values in the `sex`, `citizenship`, `peak_name` and `expedition_role` predictor the data set shrinks 76,519 to 76,471 observations