-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathcode.Rmd
235 lines (157 loc) · 8.41 KB
/
code.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
---
title: "Data Manipulation and Visualization"
author: "Instructors: Isaac Towers with help from Daniel Falster, Will Cornwell, Fonti Kar and Fiona Robinson"
output: html_document
---
# Introduction to movies dataset
For today's exercises we're using a data on movie production, from the ["Tidy Tuesday" collection](https://github.com/rfordatascience/tidytuesday).
```{r}
library(readr)
library(ggplot2)
list.files("data/movie_profit")
movies <- read_csv("data/movie_profit/movie_profit.csv")
```
What variables?
```{r}
names(movies)
```
An overview of the data
```{r}
View(movies)
```
or summarise using the `skimr` package:
```{r}
skimr::skim(movies)
str(movies)
```
**Exercise**: With your partner, review the `README` for the data at `data/movie_profit/readme.md` to see the variables included.
# What is the tidyverse?
* The [Tidyverse](http://tidyverse.org) is the name given to suite of R packages designed for seamless data analysis
* Designed to help you fall into a **"Pit of Success"**
* Tools are designed to work seamlessly together, for: 1) Turning data into tidy data, and 2) Plotting & analysing Tidy Data
* Not one but a collection packages
* Dataframes (tibbles) are the universal "tidy" input and output
Load (and install) individually or all together
```{r, eval = FALSE}
library(tidyverse)
library(ggplot2)
```
# Data manipulation with `dplyr`
Motivation:
- Data is never organized in the way you want it
- High % of project is data wrangling
- Many many many modern jobs are data wrangling
**Exercise**: Together with your partner, come up with 3 types of change you may need to make on a dataset before it is ready for analysis.
`dplyr` used verbs to describe the actions we want to take on the data
- `select` -> subset columns
- `filter` -> subset rows
- `arrange` –> order rows
- `rename` –> rename variables
- `mutate` –> make new variables
- `summarise`–> summarise data
- `distinct` -> filter to each unique row
Examples:
To select certain variables:
```{r}
select(movies, genre)
```
To filter to particular rows:
```{r}
filter(movies, distributor == "Universal")
```
To sort by certain variables:
```{r}
arrange(movies, distributor, production_budget)
```
Create a new variable:
```{r}
mutate(movies, log_budget = log10(production_budget))
```
## Pipes
The pipe is a bit of magic. It's written by `%>%` (Shift-Command-M on Mac or Shift-Control-M on PC ). We can use "the pipe" [%>%](http://magrittr.tidyverse.org/reference/pipe.html) to connect expressions
* `%>%` is an **infix operator** -> expects commands on left & right
* Comes from the [magrittr](http://magrittr.tidyverse.org/reference/pipe.html) package
* `%>%` "pipes" the **output** of the last expression as the **first input** of the next expression
* If you use RStudio, you can type the pipe with Ctrl + Shift + M if you have a PC or Cmd + Shift + M if you have a Mac.
Examples:
```{r}
movies$distributor %>% unique()
```
```{r}
movies$distributor %>% unique() %>% length()
movies$distributor %>% n_distinct()
```
But you can control the input position of the next function with `.`:
```{r}
20 %>% seq(1, 4, length.out = .)
```
Tidyverse functions are written to work with pipes, i.e. most take the data as the first argument.
```{r}
filter(movies, distributor == "Universal")
```
is the same as
```{r}
movies %>% filter(distributor == "Universal")
```
This means we can use pipes to join data verbs to make a data sentence.
```{r}
movies %>%
filter(distributor == "Universal") %>%
select(movie, mpaa_rating)
```
**Exercises:** Apply the `dplyr` package and your new data wrangling skills to the movies dataset to
1. Create a subset of the dataset consisting of only movies distributed by `Walt Disney`
2. As above and only including variables `movie`, `worldwide_gross`, and `production_budget`
3. As above but with data sorted alphabetically by `movie`
4. As above but with an additional column return given by `worldwide_gross/production_budget`
We're now going to moving onto another dataset called "gapminder", a dataset which is used to "identif(y) systematic misconceptions about important global trends and proportions and uses reliable data to develop easy to understand teaching materials to rid people of their misconceptions."
Often, data will not be available in a single spreadsheet or csv, and we will need to wrangle several datasets together into a single dataform which captures all of the necessary informaiton for us to addres our research question. Consider, for example, the gapminder datasets, which have multiple dataframes. Review the `README` for the data at `data/gapminder/readme.md`.
Lets work with two datasets for now: 1) has which continent each country belongs to, and 2) has the life expectancy for each country. First, read the data in.
```{r, exercise}
continents <- read_csv("data/gapminder/continents.csv")
life_expectancy <- read_csv("data/gapminder/life_expectancy_years.csv")
```
Lets have a look at each of these datasets.
```{r}
continents %>% View()
life_expectancy %>% View()
```
Life expectancy is a bit unusual because the columns are years, and the rows are countries. This dataset could be more useful if we could rearrange it so that the variable of interest (i.e. life expectancy) was a column, and the countries and years were rows. This is because we usually want observations (i.e. years within a country) of a given variable to be arranged as rows.
```{r}
life_expectancy %>%
pivot_longer(-country, names_to = "year", values_to = "Life_expectancy") -> life_expectancy_long
```
This code achieves a lot with just one command. Firstly, it collapses all values from all columns (except country, denoted by the `-country`) into a single column. It also collapses all column names into a single column, and gives them each the name provided from `values_to`, and `names_to`, respectively. `Pivot_longer` will come up frequently in data wrangling!
Suppose we actually needed the data in the original format. We could convert a long-format dataset into a wide dataset using the counterpart to `pivot_longer`, `pivot_wider'. `pivot_wider` uses similar syntax to `pivot_longer`:
```{r}
life_expectancy_long %>%
pivot_wider(values_from = Life_expectancy, names_from = year)
```
In this case, we define where we would like to get the values to fill in the table with (i.e. life expectancy) and what the columns should be (year).
Let's work with life_expectancy_long for now.
What if then wanted to know whether life_expectancy varies among continents. For this, we would need to combine these datasetes together. To do this, we can use a funciton called `left_join`. `left_join` automatically detects common columns between datasets, and uses these as a key to combine them together, as below.
```{r}
life_expectancy_long %>%
left_join(continents) -> life_expectancy_continent
```
As we can see, `left_join` automatically detects that both dataframes have a column called "country" and uses this as a key to combine these dataframes.
Often, we want to extract summary statistics from data, based on some variable. For example, how does mean life_expectancy varies among continents in a given year? Let's check it out for 2018. To do this, we will utilise a pair of functions: 1) `group_by` and 2) `summarise`. `group_by` defines a grouping variable of interest, in this case, `continents`. `summarise` applies a function to all observations of a defined variable within each group. In this case, we are interested in knowing the mean. First, though, we need to also drop NAs from the dataset (e.g. Andorra does not have life expectancy values)
```{r}
life_expectancy_continent %>%
filter(year == 2018) %>%
drop_na() %>%
group_by(continent) %>%
summarise(life_expectancy = mean(Life_expectancy)) -> life_expectancy_continent_summarised
```
Let's have a look at the results.
```{r}
life_expectancy_continent_summarised %>%
View()
```
Seems like Oceania has the highest life expectancy, followed by Europe, Americas, Asia, and then Africa.
**Exercises:** Apply the `dplyr` package and your new data wrangling skills to the gapminder dataset to
1. Produce a data frame that has the African values for lifeExp, country and year, but not for other Continents.
2. What was lowest life expectancy recorded? The highest?
3. Use summarise() to find the global average life expectancy in each year.
Advanced challenge:
4. Calculate the income per person for each country and each year as a fraction of the global maximum income per person value in that year