forked from edzer/sdsr
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathrbasics.qmd
358 lines (327 loc) · 8.75 KB
/
rbasics.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
# R Basics {#sec-rbasics}
This chapter provides some minimal R basics that may make it easier
to read this book. A more comprehensive book on R basics is given in
@advr, chapter 2.
## Pipes {#sec-pipes}
\index{pipes}
The `|>` (pipe) symbols should be read as _then_: we read
```{r eval=FALSE}
a |> b() |> c() |> d(n = 10)
```
as _with `a` do `b`, then `c`, then `d` with `n` being 10_, and that is just alternative syntax for
```{r eval=FALSE}
d(c(b(a)), n = 10)
```
or
```{r eval=FALSE}
tmp1 <- b(a)
tmp2 <- c(tmp1)
d(tmp2, n = 10)
```
To many, the pipe-form is easier to read because execution order
follows reading order, from left to right. Like nested function
calls, it avoids the need to choose names for intermediate results.
As with nested function calls, it is hard to debug intermediate
results that diverge from our expectations. Note that the
intermediate results do exist in memory, so neither form saves memory
allocation. The `|>` native pipe that appeared in R 4.1.0 as used
in this book, can be safely substituted by the `%>%` pipe of package
**magrittr**.
## Data structures {#sec-datastructures}
\index{data structures}
As pointed out by @extending, _everything that exists in R is an
object_. This includes objects that make things happen, such as
language objects or functions, but also the more basic "things",
such as data objects. Some basic R data structures will now be
discussed.
### Homogeneous vectors
\index{vectors}
Data objects contain data, and possibly metadata. Data is always
in the form of a vector, which can have different types. We can
find the type by `typeof`, and vector length by `length`. Vectors
are created by `c`, which combines individual elements:
```{r}
typeof(1:10)
length(1:10)
typeof(1.0)
length(1.0)
typeof(c("foo", "bar"))
length(c("foo", "bar"))
typeof(c(TRUE, FALSE))
```
Vectors of this kind can only have a single type.
Note that vectors can have zero length:
```{r}
i <- integer(0)
typeof(i)
i
length(i)
```
\newpage
We can retrieve (or in assignments: replace) elements in a vector
using `[` or `[[`:
```{r}
a <- c(1,2,3)
a[2]
a[[2]]
a[2:3]
a[2:3] <- c(5,6)
a
a[[3]] <- 10
a
```
where the difference is that `[` can operate on an index _range_
(or multiple indexes), and `[[` operates on a single vector value.
### Heterogeneous vectors: `list`
\index{list}
An additional vector type is the `list`, which can combine any types in
its elements:
```{r}
l <- list(3, TRUE, "foo")
typeof(l)
length(l)
```
For lists, there is a further distinction between `[` and `[[`: the single
`[` returns always a list, and `[[` returns the _contents_ of a list element:
```{r}
l[1]
l[[1]]
```
For replacement, one case use `[` when providing a list, and `[[` when providing
a new value:
```{r}
l[1:2] <- list(4, FALSE)
l
l[[3]] <- "bar"
l
```
In case list elements are _named_, as in
```{r}
l <- list(first = 3, second = TRUE, third = "foo")
l
```
we can use names as in `l[["second"]]` and this can be
abbreviated to
```{r}
l$second
l$second <- FALSE
l
```
This is convenient, but it also requires name look-up in the names
attribute (see below).
### NULL and removing list elements
\index{NULL}
`NULL` is the null value in R; it is special in the sense that it doesn't work
in simple comparisons:
```{r}
3 == NULL # not FALSE!
NULL == NULL # not even TRUE!
```
but has to be treated specially, using `is.null`:
```{r}
is.null(NULL)
```
When we want to remove one or more list elements, we can do so by creating
a new list that does not contain the elements that needed removal, as in
```{r}
m <- l[c(1,3)] # remove second, implicitly
m
```
but we can also assign `NULL` to the element we want to eliminate:
```{r}
l$second <- NULL
l
```
### Attributes
\index{attributes!of R objects}
We can glue arbitrary metadata objects to data objects, as in
```{r}
a <- 1:3
attr(a, "some_meta_data") = "foo"
a
```
and this can be retrieved, or replaced by
```{r}
attr(a, "some_meta_data")
attr(a, "some_meta_data") <- "bar"
attr(a, "some_meta_data")
```
In essence, the attribute of an object is a named list, and we can
get or set the complete list by
```{r}
attributes(a)
attributes(a) = list(some_meta_data = "foo")
attributes(a)
```
A number of attributes are treated specially by R, see `?attributes` for full details.
Some of the special attributes will now be explained.
#### Object class and `class` attribute
\index{class}
\index[function]{class}
Every object in R "has a class", meaning that `class(obj)` returns
a character vector with the class of `obj`. Some objects have
an _implicit_ class, such as basic vectors
```{r}
class(1:3)
class(c(TRUE, FALSE))
class(c("TRUE", "FALSE"))
```
but we can also set the class explicitly, either by using `attr` or by
using `class` in the left-hand side of an expression:
```{r}
a <- 1:3
class(a) <- "foo"
a
class(a)
attributes(a)
```
in which case the newly set class overrides the earlier implicit class. This way,
we can add methods for class `foo` by appending the class name to the method name:
```{r}
print.foo <- function(x, ...) {
print(paste("an object of class foo with length", length(x)))
}
a # when not assigned, a is printed:
```
Providing such methods is generally intended to create more usable
software, but at the same time they may make the objects more opaque. It is
sometimes useful to see what an object "is made of" by printing it after the
class attribute is removed, as in
```{r}
unclass(a)
```
As a more elaborate example, consider the case where a polygon is made using
package **sf**:
```{r}
library(sf) |> suppressPackageStartupMessages()
p <- st_polygon(list(rbind(c(0,0), c(1,0), c(1,1), c(0,0))))
p
```
which prints the well-known-text form; to understand what the data structure is
like, we can use
```{r}
unclass(p)
```
#### The `dim` attribute
\index{dim}
\index[function]{dim}
The `dim` attribute sets the matrix or array dimensions:
```{r}
a <- 1:8
class(a)
attr(a, "dim") <- c(2,4) # or: dim(a) = c(2,4)
class(a)
a
attr(a, "dim") <- c(2,2,2) # or: dim(a) = c(2,2,2)
class(a)
a
```
### The `names` attributes
Named vectors carry their names in a `names` attribute. We saw examples
for lists above, an example for a numeric vector is:
```{r}
a <- c(first = 3, second = 4, last = 5)
a["second"]
attributes(a)
```
Other name attributes include `dimnames` for `matrix` or `array`,
which not only names dimensions but also the labels associated values
of each of the dimensions:
```{r}
a <- matrix(1:4, 2, 2)
dimnames(a) <- list(rows = c("row1", "row2"),
cols = c("col1", "col2"))
a
attributes(a)
```
\newpage
Data.frame objects have rows and columns, and each has names:
```{r}
df <- data.frame(a = 1:3, b = c(TRUE, FALSE, TRUE))
attributes(df)
```
### Using `structure`
\index[function]{structure}
When programming, the pattern of adding or modifying attributes before returning
an object is extremely common, an example being:
```{r, eval=FALSE}
f <- function(x) {
a <- create_obj(x) # call some other function
attributes(a) <- list(class = "foo", meta = 33)
a
}
```
The last two statements can be contracted in
```{r, eval=FALSE}
f <- function(x) {
a <- create_obj(x) # call some other function
structure(a, class = "foo", meta = 33)
}
```
where function `structure` adds, replaces, or (in case of value `NULL`) removes
attributes from the object in its first argument.
## Dissecting a `MULTIPOLYGON` {#sec-dissecting}
We can use the above examples to dissect an `sf` object with
`MULTIPOLYGON`s into pieces. Suppose we use the `nc` dataset,
```{r}
system.file("gpkg/nc.gpkg", package = "sf") %>%
read_sf() -> nc
```
we can see from the attributes of `nc`,
```{r}
attributes(nc)
```
that the geometry column is named `geom`. When we take out this column,
```{r}
nc$geom
```
we see an object that has the following attributes
```{r}
attributes(nc$geom)
```
When we take the _contents_ of the fourth list element, we obtain
```{r}
nc$geom[[4]] |> format(width = 60, digits = 5)
```
which is a (classed) list,
```{r}
typeof(nc$geom[[4]])
```
with attributes
```{r}
attributes(nc$geom[[4]])
```
and length
```{r}
length(nc$geom[[4]])
```
The length indicates the number of outer rings: a multi-polygon
can consist of more than one polygon. We see that most counties
only have a single polygon:
```{r}
lengths(nc$geom)
```
A multi-polygon is a list with polygons,
```{r}
typeof(nc$geom[[4]])
```
and the _first_ polygon of the fourth multi-polygon is again a list,
because polygons have an outer ring _possibly_ followed by multiple inner
rings (holes)
```{r}
typeof(nc$geom[[4]][[1]])
```
we see that it contains only one ring, the exterior ring:
```{r}
length(nc$geom[[4]][[1]])
```
and we can print type, the dimension and the first set of coordinates by
```{r}
typeof(nc$geom[[4]][[1]][[1]])
dim(nc$geom[[4]][[1]][[1]])
head(nc$geom[[4]][[1]][[1]])
```
and we can now for instance change the latitude of the third coordinate by
```{r}
nc$geom[[4]][[1]][[1]][3,2] <- 36.5
```