-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathotbirders_2_stats_2018.Rmd
328 lines (260 loc) · 10.6 KB
/
otbirders_2_stats_2018.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
---
title: Wednesday Birders - 2018 stats and historical analysis using R
author: Mark Isken
date: 2019-02-04
category: R, python
tags: R, python, birding, dplyr, ggplot2, ebird
summary: Using the eBird API from Python, we downloaded birding lists submitted by our Wednesday morning birding group. R was then used to produce various plots relating to counts of birds sighted and frequency of list appearance, by species.
output: html_document
---
## The Wednesday Birders
In a [previous post](http://hselab.org/wednesday-ot-birders-using-the-ebird-api-python-and-r-to-analyze-data-for-our-birding-group.html) I described how I used a mix of Python and R to acquire, clean, and analyze [eBird](https://ebird.org/home) data from our birding group's weekly walks. In this post, I'll add plots for 2018 and do a little more analysis.
To summarize the basic approach (described in the first post), we:
* used the eBird API (2.0) with Python to download data from our bird lists into a pandas dataframe and then exported to csv file,
* used R to clean up the data to make sure we were just using our Wednesday Birders lists for the analysis,
* used the R packages dplyr and ggplot2 to summarize and make plots of
species counts by year.
## Data prep
All of the data prep and analysis is done in R. We'll need a few libraries:
```{r libraries, echo=TRUE, warning=FALSE, results='hide', message=FALSE}
library(dplyr)
library(ggplot2)
library(lubridate)
```
Before diving into analysis and plots, a little data prep was needed:
* downloaded data from 2018 and converted the JSON to csv (used appropriately modified Python script detailed in first post)
* read CSV file into an R dataframe
* convert datetime fields to POSIXct
* include only the lists from our Wednesday morning walks
* combined the 2018 data with the data from 2015-1017 and saved as an RDS file
```{r read_obs_rds}
obs_df <- readRDS(file = "./data/observations.rds")
```
There are a bunch of fields in the observation records from eBird. Here's
just a sample focusing on the stuff we need. Each row in the
data frame corresponds to an individual species count on a certain date
in a certain location.
```{r}
obs_df <- obs_df %>%
select(checklistId, obsDt, locName, comName, sciName, howMany, subId)
head(obs_df)
```
## Plots of Species Counts
How many birds of each species have we seen? How frequently are each species
seen?
Let's start with simple bar charts:
* one bar per species, one year per graph,
* bar length is number of birds seen,
* color of bar is related to percentage of lists on which that species seen,
* number at end of bar is percentage of lists on which that species seen.
Here are the plots from 2015-2017:
```{r}
knitr::include_graphics("images/top30_2015.png")
knitr::include_graphics("images/top30_2016.png")
knitr::include_graphics("images/top30_2017.png")
```
## Creating the plot for 2018
```{r create_topobs, echo=FALSE}
# Create num species by list df
numsp_bylist <- obs_df %>%
group_by(year=year(obsDt), obsDt, locName, subId) %>%
summarize(
n = n(),
totbirds = sum(howMany)
) %>%
arrange(year, subId)
# Using numsp_bylist, create num lists by date
numlists_bydt <- numsp_bylist %>%
group_by(obsDt) %>%
summarise(
numlists = n()
) %>%
filter(numlists >= 1) %>%
arrange(obsDt)
# Usings numlists_bydt, create num lists by year
numlists_byyear <- numlists_bydt %>%
group_by(birding_year=year(obsDt)) %>%
summarise(
totlists = sum(numlists)
)
# Now ready to compute species by year
species_byyear <- obs_df %>%
group_by(comName, birding_year=year(obsDt)) %>%
summarize(
num_lists = n(),
tot_birds = sum(howMany)
) %>%
arrange(birding_year, desc(tot_birds))
# Join to numlists_byyear so we can compute pct of lists each species
# appeared in.
species_byyear <- left_join(species_byyear, numlists_byyear, by = 'birding_year')
bird_year <- 2018
top_obs_byyear <- species_byyear %>%
filter(birding_year == bird_year) %>%
mutate(pctlists = num_lists / totlists) %>%
arrange(desc(tot_birds)) %>%
head(30)
```
The plots above are easy to create from a dataframe that looks like this:
```{r top_obs_byyear, echo=FALSE}
as.data.frame(top_obs_byyear)
```
Here's how you can make such a data frame.
```{r create_topobs_show}
# Create num species by list df
numsp_bylist <- obs_df %>%
group_by(year=year(obsDt), obsDt, locName, subId) %>%
summarize(
n = n(),
totbirds = sum(howMany)
) %>%
arrange(year, subId)
# Using numsp_bylist, create num lists by date
numlists_bydt <- numsp_bylist %>%
group_by(obsDt) %>%
summarise(
numlists = n()
) %>%
filter(numlists >= 1) %>%
arrange(obsDt)
# Usings numlists_bydt, create num lists by year
numlists_byyear <- numlists_bydt %>%
group_by(birding_year=year(obsDt)) %>%
summarise(
totlists = sum(numlists)
)
# Now ready to compute species by year
species_byyear <- obs_df %>%
group_by(comName, birding_year=year(obsDt)) %>%
summarize(
num_lists = n(),
tot_birds = sum(howMany)
) %>%
arrange(birding_year, desc(tot_birds))
# Join to numlists_byyear so we can compute pct of lists each species
# appeared in.
species_byyear <- left_join(species_byyear, numlists_byyear, by = 'birding_year')
bird_year <- 2018
top_obs_byyear <- species_byyear %>%
filter(birding_year == bird_year) %>%
mutate(pctlists = num_lists / totlists) %>%
arrange(desc(tot_birds)) %>%
head(30)
```
Make the plot.
```{r plot2018}
ggplot(top_obs_byyear) +
geom_bar(aes(x=reorder(comName, tot_birds),
y=tot_birds, fill=pctlists), stat = "identity") +
scale_fill_gradient(low='#05D9F6', high='#5011D1') +
labs(x="",
y="Total number of birds sighted",
fill="Pct of lists",
title = paste0("Top 30 most sighted birds by Wednesday Birders in ", bird_year),
subtitle = "(number at right of bar is % of lists on which the species appeared)") +
coord_flip() +
geom_text(data=top_obs_byyear,
aes(x=reorder(comName,tot_birds),
y=tot_birds,
label=format(pctlists, digits = 1),
hjust=0
), size=3) + ylim(0,500)
```
## Patterns and changes over time
Now that we have four full years of eBird sightings data from the
four parks we visit each month, let's see how things look over time (and location).
Here's an overall time series plot of number of species appearing on
each list.
```{r}
numsp_bylist %>%
ggplot() + geom_point(aes(x = as.Date(obsDt), y = n)) +
ggtitle("Number of species sighted for each list", subtitle = "2015-2018") +
scale_x_date(date_breaks = "3 months",
date_labels = "%m-%y") +
xlab("List date (month-year)") +
ylab("Number of species")
```
Same plot but faceted by park.
```{r fig.height=8}
numsp_bylist %>%
ggplot() + geom_point(aes(x = as.Date(obsDt), y = n)) +
facet_grid(locName ~ .) +
ggtitle("Number of species sighted for each list", subtitle = "2015-2018") +
scale_x_date(date_breaks = "3 months",
date_labels = "%m-%y") +
xlab("List date (month-year)") +
ylab("Number of species")
```
Let's look at distribution of list size during the peak period of
April-October.
```{r}
numsp_bylist %>%
filter(month(obsDt) %in% 4:10) %>%
ggplot() + geom_histogram(aes(x = n, y = ..density..), bins = 20) +
ggtitle("Distribution of list size", subtitle = "April-October") +
xlab("List size (number of species)") +
geom_density(aes(x = n))
```
Use boxplots and group by park.
```{r}
numsp_bylist %>%
filter(month(obsDt) %in% 4:10) %>%
ggplot() + geom_boxplot(aes(x = locName, y = n)) +
ggtitle("Distribution of list size", subtitle = "April-October") +
xlab("Location") + ylab("List size (number of species)")
```
Let's repeat the above but using total number of birds counted per
list instead of number of species per list.
```{r}
numsp_bylist %>%
filter(month(obsDt) %in% 4:10) %>%
ggplot() + geom_boxplot(aes(x = locName, y = totbirds)) +
ggtitle("Distribution of number of total birds per list", subtitle = "April-October") +
xlab("Location") + ylab("List size (number of birds counted)")
```
Let's look at time series plots of number of birds counted for each
of the species in the Top 30 for 2018. We'll need the following
data frame to power the plots.
```{r}
top_species_byyear <- top_obs_byyear %>%
select("comName") %>%
inner_join(species_byyear, by = "comName") %>%
mutate(pctlists = num_lists / totlists)
top_species_byyear
```
Bar height is total number of birds counted and bar color
is related to percentage of lists on which this bird appeared.
```{r fig.height=10, fig.width=10}
ggplot(top_species_byyear) +
geom_bar(aes(x = birding_year, y = tot_birds, fill=pctlists),
stat = "identity") +
scale_fill_gradient(low='#05D9F6', high='#5011D1') +
ggtitle("Number of birds counted by year") +
facet_wrap(~comName, ncol = 5, scales = "free_y") +
labs(x="Year",
y="Total number of birds sighted",
fill="Pct of lists")
```
A few things pop out:
- overall number of [Black-capped Chicadees](https://www.allaboutbirds.org/guide/Black-capped_Chickadee) counted has declined, though they are still sighted very frequently. Maybe we've become somewhat immune to their
constant chatter and just aren't recording as many. "Oh, just another chicadee."
- [Downy Woodpecker](https://www.allaboutbirds.org/guide/Downy_Woodpecker) sightings have declined both in terms of number and frequency as have [White-breasted Nuthatch](https://www.allaboutbirds.org/guide/White-breasted_Nuthatch) sightings.
- [Mourning Doves](https://www.allaboutbirds.org/guide/Mourning_Dove) are up in both number and frequency.
- After 2015, sightings of [Eastern Bluebirds](https://www.allaboutbirds.org/guide/Eastern_Bluebird) have been remarkably constant.
Now swap roles of total birds and frequency of appearing on a list. Bar
height is percentage of lists on which this bird appeared and bar color
is related to total number of birds counted.
```{r fig.height=10, fig.width=10}
ggplot(top_species_byyear) +
geom_bar(aes(x = birding_year, y = pctlists, fill = tot_birds),
stat = "identity") +
scale_fill_gradient(low='#05D9F6', high='#5011D1') +
ggtitle("Number of birds counted by year") +
facet_wrap(~comName, ncol = 5) +
labs(x="Year",
y="Pct of lists",
fill="Total birds counted")
```
## Next steps
Now our group can try to make sense of these and I'm sure we'll have ideas for more analysis. In my next post I'm going to
explore those springtime favorites, the [warblers](https://www.allaboutbirds.org/guide/browse/taxonomy/Parulidae).