-
Notifications
You must be signed in to change notification settings - Fork 243
/
Copy pathdplyr-tutorial-2.Rmd
294 lines (205 loc) · 8.48 KB
/
dplyr-tutorial-2.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
---
title: 'Going deeper with dplyr: New features in 0.3 and 0.4'
output: html_document
---
## Introduction
In August 2014, I created a [40-minute video tutorial](https://www.youtube.com/watch?v=jWjqLW-u3hc) introducing the key functionality of the dplyr package in R, using dplyr version 0.2. Since then, there have been two significant updates to dplyr (0.3 and 0.4), introducing a ton of new features.
This document (created in March 2015) covers the most useful new features in 0.3 and 0.4, as well as other functionality that I didn't cover last time (though it is not necessarily new). My [new video tutorial](https://www.youtube.com/watch?v=2mh1PqfsXVI) walks through the code below in detail.
**If you have not watched the [previous tutorial](https://www.youtube.com/watch?v=jWjqLW-u3hc)**, I recommend you do so first since it covers some dplyr basics that will not be covered in this tutorial.
## Loading dplyr and the nycflights13 dataset
Although my last tutorial used data from the hflights package, Hadley Wickham has rewritten the [dplyr vignettes](http://cran.r-project.org/web/packages/dplyr/index.html) to use the nycflights13 package instead, and so I'm also using nycflights13 for the sake of consistency.
```{r eval=FALSE}
# remove flights data if you just finished my previous tutorial
rm(flights)
```
```{r}
# load packages
suppressMessages(library(dplyr))
library(nycflights13)
# print the flights dataset from nycflights13
flights
```
## Choosing columns: select, rename
```{r}
# besides just using select() to pick columns...
flights %>% select(carrier, flight)
# ...you can use the minus sign to hide columns
flights %>% select(-month, -day)
```
```{r results='hide'}
# hide a range of columns
flights %>% select(-(dep_time:arr_delay))
# hide any column with a matching name
flights %>% select(-contains("time"))
```
```{r}
# pick columns using a character vector of column names
cols <- c("carrier", "flight", "tailnum")
flights %>% select(one_of(cols))
```
```{r}
# select() can be used to rename columns, though all columns not mentioned are dropped
flights %>% select(tail = tailnum)
# rename() does the same thing, except all columns not mentioned are kept
flights %>% rename(tail = tailnum)
```
## Choosing rows: filter, between, slice, sample_n, top_n, distinct
```{r}
# filter() supports the use of multiple conditions
flights %>% filter(dep_time >= 600, dep_time <= 605)
```
```{r results='hide'}
# between() is a concise alternative for determing if numeric values fall in a range
flights %>% filter(between(dep_time, 600, 605))
# side note: is.na() can also be useful when filtering
flights %>% filter(!is.na(dep_time))
```
```{r}
# slice() filters rows by position
flights %>% slice(1000:1005)
# keep the first three rows within each group
flights %>% group_by(month, day) %>% slice(1:3)
# sample three rows from each group
flights %>% group_by(month, day) %>% sample_n(3)
# keep three rows from each group with the top dep_delay
flights %>% group_by(month, day) %>% top_n(3, dep_delay)
# also sort by dep_delay within each group
flights %>% group_by(month, day) %>% top_n(3, dep_delay) %>% arrange(desc(dep_delay))
```
```{r}
# unique rows can be identified using unique() from base R
flights %>% select(origin, dest) %>% unique()
```
```{r results='hide'}
# dplyr provides an alternative that is more "efficient"
flights %>% select(origin, dest) %>% distinct()
# side note: when chaining, you don't have to include the parentheses if there are no arguments
flights %>% select(origin, dest) %>% distinct
```
## Adding new variables: mutate, transmute, add_rownames
```{r}
# mutate() creates a new variable (and keeps all existing variables)
flights %>% mutate(speed = distance/air_time*60)
# transmute() only keeps the new variables
flights %>% transmute(speed = distance/air_time*60)
```
```{r}
# example data frame with row names
mtcars %>% head()
# add_rownames() turns row names into an explicit variable
mtcars %>% add_rownames("model") %>% head()
# side note: dplyr no longer prints row names (ever) for local data frames
mtcars %>% tbl_df()
```
## Grouping and counting: summarise, tally, count, group_size, n_groups, ungroup
```{r}
# summarise() can be used to count the number of rows in each group
flights %>% group_by(month) %>% summarise(cnt = n())
```
```{r results='hide'}
# tally() and count() can do this more concisely
flights %>% group_by(month) %>% tally()
flights %>% count(month)
```
```{r}
# you can sort by the count
flights %>% group_by(month) %>% summarise(cnt = n()) %>% arrange(desc(cnt))
```
```{r results='hide'}
# tally() and count() have a sort parameter for this purpose
flights %>% group_by(month) %>% tally(sort=TRUE)
flights %>% count(month, sort=TRUE)
```
```{r}
# you can sum over a specific variable instead of simply counting rows
flights %>% group_by(month) %>% summarise(dist = sum(distance))
```
```{r results='hide'}
# tally() and count() have a wt parameter for this purpose
flights %>% group_by(month) %>% tally(wt = distance)
flights %>% count(month, wt = distance)
```
```{r}
# group_size() returns the counts as a vector
flights %>% group_by(month) %>% group_size()
# n_groups() simply reports the number of groups
flights %>% group_by(month) %>% n_groups()
```
```{r}
# group by two variables, summarise, arrange (output is possibly confusing)
flights %>% group_by(month, day) %>% summarise(cnt = n()) %>% arrange(desc(cnt)) %>% print(n = 40)
# ungroup() before arranging to arrange across all groups
flights %>% group_by(month, day) %>% summarise(cnt = n()) %>% ungroup() %>% arrange(desc(cnt))
```
## Creating data frames: data_frame
`data_frame()` is a better way than `data.frame()` for creating data frames. Benefits of `data_frame()`:
* You can use previously defined columns to compute new columns.
* It never coerces column types.
* It never munges column names.
* It never adds row names.
* It only recycles length 1 input.
* It returns a local data frame (a tbl_df).
```{r}
# data_frame() example
data_frame(a = 1:6, b = a*2, c = 'string', 'd+e' = 1) %>% glimpse()
# data.frame() example
data.frame(a = 1:6, c = 'string', 'd+e' = 1) %>% glimpse()
```
## Joining (merging) tables: left_join, right_join, inner_join, full_join, semi_join, anti_join
```{r}
# create two simple data frames
(a <- data_frame(color = c("green","yellow","red"), num = 1:3))
(b <- data_frame(color = c("green","yellow","pink"), size = c("S","M","L")))
# only include observations found in both "a" and "b" (automatically joins on variables that appear in both tables)
inner_join(a, b)
# include observations found in either "a" or "b"
full_join(a, b)
# include all observations found in "a"
left_join(a, b)
# include all observations found in "b"
right_join(a, b)
# right_join(a, b) is identical to left_join(b, a) except for column ordering
left_join(b, a)
# filter "a" to only show observations that match "b"
semi_join(a, b)
# filter "a" to only show observations that don't match "b"
anti_join(a, b)
```
```{r}
# sometimes matching variables don't have identical names
b <- b %>% rename(col = color)
# specify that the join should occur by matching "color" in "a" with "col" in "b"
inner_join(a, b, by=c("color" = "col"))
```
## Viewing more output: print, View
```{r}
# specify that you want to see more rows
flights %>% print(n = 15)
```
```{r eval=FALSE}
# specify that you want to see ALL rows (don't run this!)
flights %>% print(n = Inf)
```
```{r}
# specify that you want to see all columns
flights %>% print(width = Inf)
```
```{r eval=FALSE}
# show up to 1000 rows and all columns
flights %>% View()
# set option to see all columns and fewer rows
options(dplyr.width = Inf, dplyr.print_min = 6)
# reset options (or just close R)
options(dplyr.width = NULL, dplyr.print_min = 10)
```
## Resources
* Release announcements for [version 0.3](http://blog.rstudio.org/2014/10/13/dplyr-0-3-2/) and [version 0.4](http://blog.rstudio.org/2015/01/09/dplyr-0-4-0/)
* [dplyr reference manual and vignettes](http://cran.r-project.org/web/packages/dplyr/)
* [Two-table vignette](http://cran.r-project.org/web/packages/dplyr/vignettes/two-table.html) covering joins and set operations
* [RStudio's Data Wrangling Cheat Sheet](http://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf) for dplyr and tidyr
* [dplyr GitHub repo](https://github.com/hadley/dplyr) and [list of releases](https://github.com/hadley/dplyr/releases)
## Data School
* [Blog](http://www.dataschool.io/)
* [Email newsletter](http://www.dataschool.io/subscribe/)
* [YouTube channel](http://youtube.com/dataschool)
< END OF DOCUMENT >