Skip to content

Commit

Permalink
Expanded subsection on data frames; edited for clarity
Browse files Browse the repository at this point in the history
  • Loading branch information
abner-hb committed Apr 8, 2024
1 parent 72c762b commit 4b42bbf
Show file tree
Hide file tree
Showing 3 changed files with 202 additions and 94 deletions.
114 changes: 78 additions & 36 deletions 03_data_in_r.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -4,60 +4,59 @@

## Data types

Data types are classifications of data. These classifications help **R** conform to our intuitive expectations. For example, we usually expect to be able to multiply numbers by each other, but not words. There are six types of data in **R**: doubles, integers, logicals, characters, complex, and raw.
Data types are classifications of data. These classifications help **R** conform to our intuition. For example, multiplying numbers by each other feels right, but multiplying words by each other does not. There are different rules for storing and handling each of these types. And learning these rules will allow us to analyze data later with less effort and fewer mistakes.

+ **Doubles**: regular numbers with a decimal value (which may be zero). The numbers can be positive or negative, large or small. In general, R will save any number that you type in R as a double.
There are six types of data in **R**: doubles, integers, logicals, characters, complex, and raw. **Doubles** are regular numbers with a decimal value (which may be zero). In general, R will save any number that you type in R as a double.
```{r double value}
my_double <- 5
typeof(my_double)
```

+ **Integers**: numbers that can be written without a decimal component. To create an integer you must type a number followed by an `L`:
**Integers** are numbers that have no decimal component. To create an integer you must type a number followed by an `L`:
```{r integer value}
my_integer <- 5L
typeof(my_integer)
```
In data science, we often don't need integers because we can save them as doubles. But **R** stores integers with more precision than doubles. So, integers are still helpful when dealing with complicated operations.
In data science, we often don't need integers because we can save them as doubles. But **R** stores integers more precisely than doubles. So, integers are still helpful when dealing with complicated operations.

+ **Logicals**: truth values `TRUE` and `FALSE`, **R**'s form of Boolean data. `NA`, which denotes a missing value, is a special type of logical value. We often have to work with logical values when we compare numbers or objects:
**Logicals** are truth values `TRUE` and `FALSE`. **R** also has a type of logical value called `NA`, which denotes a missing value. We often have to work with logical values when we compare numbers or objects:
```{r logical value}
my_comparison <- -3 < 1
typeof(my_comparison)
```
In most situations, **R** will assume that `T` and `F` are abbreviations of `TRUE` and `FALSE`. But not always, so I suggest you always write the full words.

+ **Characters**: text, or symbols we want to handle as text. You can create a character vector by typing a character or *string* of characters surrounded by quotes:
**Characters** are text, like "hello", "Elvis", or "Somewhere in La Mancha"; or symbols we want to handle as text, like "size 45", or "mail/u". You can create a character vector by typing a character or *string* of characters surrounded by quotes:
```{r character value}
my_character <- "Somewhere in La Mancha"
typeof(my_character)
```
*Anything* surrounded by quotes in R will be treated as a character string---regardless of what is between the quotes.

It is easy to confuse **R** objects with character strings because both appear as pieces of text in R code. For example, `x` is the name of an R object named "x" that contains data; but `"x"` is a character string that contains the character "x", i.e., it is itself a piece of data. We can differentiate strings from real numbers because strings always come surrounded by quotation marks. Also, in **R**Studio strings have different colors from other data types.

```{r character vs double}
typeof("9")
typeof(9)
```

If we forget to use the quotation marks when writing a name, **R** will look for an object that likely doesn't exist, so we will most likely get an error.
It is easy to confuse **R** objects with character strings because both appear as pieces of text in R code. For example, `the_thing` is the name of an R object named "the_thing" that contains data; but `"the_thing"` is a character string that contains the character "the_thing", i.e., it is itself a piece of data. If we forget to use the quotation marks when writing a name, **R** will look for an object that likely doesn't exist, so we will most likely get an error.
```{r}
#| error: true
noquotes
```

We can differentiate strings from real numbers because **R** always shows strings surrounded by quotation marks. And because in **R**Studio, strings have different colors from other data types.

```{r character vs double}
typeof("9")
typeof(9)
```

A special type of character string is a factor. Factors are **R**'s way of storing categorical information, like color or level of agreement. They can only have certain values (e.g., "red" or "green"), and these values may have their own particular order (e.g., "agree", "neutral", "disagree").
A special type of character string is a *factor.* Factors are **R**'s way of storing categorical information, like color or level of agreement. They can only have certain values (e.g., "red" or "green"), and these values may have their own order (e.g., "agree", "neutral", "disagree").

+ **Complex and Raw types**. **R** can also handle imaginary numbers (called "complex") and raw bytes of data (called "raw"). It is unlikely that you will ever need to use these data types, so I will not explain them in these notes.
**R** can also handle imaginary numbers (called **complex**) and raw bytes of data (called **raw**). It is unlikely that you will ever need to use these data types, so I will not explain them in these notes.

## Data structures

Data structures are ways of organizing data. They make it easier for us to manipulate and operate with data. Different data structures have different advantages and limitations.

### Atomic vectors

An atomic vector stores its values as a one-dimensional group. All the elements of an atomic vector must be of the same type of data, with one exception: any vector can include `NA` as a value regardless of the type of the other values. This vector is called "atomic" because we can think of it as the most basic type of data structure.
Atomic vectors store values as one-dimensional groups. All the elements of an atomic vector must be of the same type of data, with one exception: any vector can include `NA` as a value regardless of the type of the other values. These vector are called "atomic" because we can think of them as the most basic type of data structure.

To create an atomic vector, we can group values using the combine function `c()`:

Expand All @@ -78,23 +77,23 @@ length(c())

Adding different data types to the same atomic vector does not produce an error. Instead, **R** automatically follows specific rules to *coerce* everything inside the vector to be of the same type. If a character string is present in an atomic vector, **R** will convert all other values to character strings. If a vector only contains logicals and numbers, R will convert the logicals to numbers; every `TRUE` becomes a `1`, and every `FALSE` becomes a `0`.

Following these coercion rules helps preserve information. It is easy, for example, to recognize the original type of `"TRUE"` and `"3.14"`. Or to transform a vector of `1`s and `0`s back to `TRUE`s and `FALSE`s.
Following these coercion rules helps preserve information. It is easy, for example, to recognize the original type of strings `"TRUE"` and `"3.14"`. Or to transform a vector of `0`s and `1`s back to `TRUE`s and `FALSE`s.
:::

### Matrices

A matrix stores values in a two-dimensional box. To create a matrix, first give `matrix()` an atomic vector to reorganize into a matrix. Then, define the number of rows using the `nrow` argument, or the number of columns using the `ncol` argument. `matrix()` will reshape your vector into a matrix with the specified number of rows (or columns).

```{r create matrix}
scores_vec <- c(50, 90, 65, -10, 115, -23)
scores_vec <- c(-27, 2, 2, 14, -28, 35, 8, 13, 4)
scores_mat <- matrix(data = scores_vec, nrow = 3)
scores_mat
# Equivalently
scores_mat <- matrix(data = scores_vec, ncol = 2)
scores_mat <- matrix(data = scores_vec, ncol = 3)
scores_mat
```

Like atomic vectors, matrices can have any data type, but only one data type:
Like atomic vectors, matrices can have any data type, but only one data type (or `NA`):
```{r}
character_mat <- matrix(
data = c("Mario", "Peach", "Luigi", "Yoshi"),
Expand All @@ -103,15 +102,15 @@ character_mat <- matrix(
character_mat
```

By default `matrix()` will fill up the matrix column by column; but you can fill the matrix row by row if you include the argument `byrow = TRUE`:
By default `matrix()` will fill up the matrix column by column. But we can fill the matrix row by row if we include the argument `byrow = TRUE`:

```{r create matrix filling by row}
scores_vec <- c(-27, 2, 2, 14, -28, 35, 8, 13, 4)
scores_mat <- matrix(data = scores_vec, nrow = 3, byrow = TRUE)
scores_mat
```

Notice the expressions with square brackets in the output above (e.g., `[,1]`). They are position indices that signal the "coordinates" of the matrix. Two-dimensional object like matrices have two indices, one for each dimension. The first number always refers to the row, and the second always refers to the column. So, as with vectors, we can use square bracket notation `[ ]` to extract values from matrices.
Notice the expressions with square brackets in the output above (e.g., `[,1]`). They are positional indices that signal the "coordinates" of the matrix. Two-dimensional object like matrices have two indices, one for each dimension. The first number always refers to the row, and the second always refers to the column. So, as with vectors, we can use square bracket notation `[ ]` to extract values from matrices.
```{r extract value from matrix}
scores_mat[c(1, 3), 2] # Rows 1 and 3 in column 2
```
Expand All @@ -124,7 +123,7 @@ scores_mat[, 1] # Extract entire first column

::: {.callout-note}
## Matrices are fancy vectors
Deep down, **R** thinks of a matrix as a vector folded to look like a square. That means that you can reference an element of a vector with a single positional index. Check what happens if you run `scores_mat[5]`.
Deep down, **R** thinks of a matrix as a vector folded to look like a square. That means, among other things, that you can reference an element of a vector with a single positional index. Check what happens if you run `scores_mat[5]`.
:::

We can define names for the rows and the columns of a matrix using the `rownames()` and `colnames()` functions.
Expand All @@ -141,7 +140,7 @@ scores_mat["Andrew", c("Rapture", "Columbia")]

There are several useful functions to do matrix operations. For example:
```{r matrix operations}
#| results: false
#| eval: false
t(scores_mat) # Transpose the matrix
diag(scores_mat) # Extract values in diagonal
scores_mat + scores_mat # Matrix addition
Expand All @@ -151,7 +150,7 @@ scores_mat %*% scores_mat # Matrix multiplication

### Arrays

The `array()` function creates an n-dimensional array. Using an n-dimensional array is like stacking groups of data like [matryoshka dolls](https://en.wikipedia.org/wiki/Matryoshka_doll). 1 dimension forms a column of data with multiple values; 2 dimensions form a sheet of paper with several columns of data; 3 dimensions form a book with several sheets; 4 dimensions form a box with several books; 5 dimensions form a box that contains other boxes, and so on.
The `array()` function creates an n-dimensional array. Using an n-dimensional array is like stacking groups of data. 1 dimension forms a column of data with multiple values; 2 dimensions are like a sheet of paper with several columns of data; 3 dimensions are like a book with several sheets; 4 dimensions are like a box with several books; 5 dimensions are like room that contains boxes, and so on. Note that layers of an array have consistent sizes. All books have the same number of sheets, and all sheets have the same number of rows and columns.

To use `array()`, we need an atomic vector as the first argument, and a vector of dimension sizes `dim` as the second argument:

Expand All @@ -164,20 +163,20 @@ The `dim` argument works from the inside out. The first value is the number of e
Note that the total number of elements in the array is equal to multiplying the sizes of all dimensions. If the vector we use to build the array has a different number of elements, **R** will discard or recycle values from the vector. Check it yourself.

::: {.callout-tip}
## Practice your inception
## Applied inception

Try to make an array with 4 dimensions. Following the metaphor from above, try to make a box that contains 3 books, each of which has 4 sheets with 2 columns and 2 rows each. See a quick solution below.

```{r inception solution}
#| code-fold: true
#| results: false
#| eval: false
array(c(1:48), dim = c(2, 2, 4, 3))
```
:::

Vectors, matrices, and arrays need all of its values to be of the same type. This requirement seems rigid, but it allows the computer to store large sets of numbers in a simple and efficient way; and it accelerates computations because **R** knows that it can manipulate all values in the object the same way. Also, vectors make it easy for us to store values that are supposed to measure the same property. It would be hard to understand what a vector represented if it had values like `"salsa"` and `sqrt(77)`.

But sometimes we need to store different types of data in a single place---maybe because all of that data belongs to the same underlying concept. For example, we can describe a dog based on its height, weight, and age (numerical values), and on its color and breed (character strings). **R** has a way of keeping all of these diverse data in a single place.
But sometimes we need to store different types of data in a single place---maybe because all of that data belongs to the same underlying concept. For example, we can describe a dog based on its height, weight, and age (numerical values), and on its color and breed (character strings). **R** can keep all of these in a single place.

### Lists

Expand All @@ -188,9 +187,9 @@ all_in_one_list <- list(c(3.1, 10), "El Zorro", list(character_mat))
all_in_one_list
```

The double-bracketed indices tell you which *element* of the list is displayed. The single-bracket indexes tell you which *subelement* of an element is displayed. For example, `3.1` is the first subelement of the first element in the list, and `"El Zorro"` is the first sub-element of the second element. This two-system notation helps us recognize which level of the stacking we are in, regardless of what is stacked inside the list.
The double-bracketed indices tell you which *element* of the list is displayed. The single-bracket indices tell you which *subelement* of an element is displayed. For example, `3.1` is the first subelement of the first element in the list, and `"El Zorro"` is the first sub-element of the second element. This double notation helps us recognize which level of the stacking we are in, regardless of what is stacked inside the list.

There are two ways to access an element from a list, depending on what we want to do with the output. We can use single bracket notation `[ ]` to get a new list with elements from the original list.
There are two ways to access an element from a list, depending on what we want to do with the output. We can use single bracket notation `[ ]` to get a new list with elements from the original list:
```{r extracting list elements as a new list}
new_list <- all_in_one_list[1]
new_list
Expand All @@ -199,7 +198,7 @@ typeof(new_list)

Or we can use double bracket notation `[[ ]]` to get only the contents of an element from the original list (we can not extract multiple elements this way).
```{r extracting a single element without more lists}
new_item <- all_in_one_list[[2]]
new_item <- all_in_one_list[[1]]
new_item
typeof(new_item)
```
Expand All @@ -215,21 +214,64 @@ countries_info <- list(

Or we can name them after the list is made using `names()`:
```{r name list after creation}
names(all_in_one_list) <- c("mass", "hero", "game")
names(all_in_one_list) <- c("weight", "hero", "game")
all_in_one_list
```

With a named list, we can also use dollar sign notation `$` to extract elements. This notation produces the same result as using the double bracket notation `[[ ]]`
```{r}
```{r extract with $ notation}
countries_info$speak_spanish
countries_info[["speak_spanish"]]
```

### Data frames

Data frames are the most common storage structure for data analysis. We can think of them as a group of atomic vectors (columns), where different vectors can have different data types. Usually, each row of a data frame represents an individual observation and each column represents a different measurement or variable of that observation.
Data frames are the most common storage structure for data analysis. We can think of them as a group of atomic vectors (columns), where different vectors can have different data types and must have the same length. Usually, each row of a data frame represents an individual observation and each column represents a different measurement or variable of that observation.

We can create a data frame using the `data.frame()` function. Give `data.frame()` any number of vectors of equal length, each separated with a comma. Each vector should be set equal to a name that describes the vector. data.frame will turn each vector into a column of the new data frame:
```{r create data frame from scratch}
aliens_df <- data.frame(
name = c("Axanim", "Blob", "Cloomin", "Dlemex"),
planet = c("Kepler-5", "Patzapuan", "Laodic_Prime", "Future_Earth"),
number_of_arms = c(5, NA, 1, 2.5)
)
aliens_df
```

If we try to use vectors of different lengths, **R** will recycle values of some vectors to ensure that the data frame has a square shape.

We can use the function `dim()` to get the size of each dimension of the data frame, and the function `str()` to get a compact summary:
```{r minimal summary of data frame}
dim(aliens_df)
str(aliens_df)
```

The `str()` function gives us the data frame dimensions and also reminds us that `aliens_df` is a data.frame type object. `str()` also lists all of the variables (columns) contained in the data frame, tells us what type of data the variables contain and prints out the first five values.

To us, a data frame resembles a matrix, but to **R** it is a list with an attribute "class" set to "data.frame"

```{r type and class of data frame}
typeof(aliens_df)
class(aliens_df)
```

As with lists, we can extract extract values from data frames using single brackets `[]`, which produce a new data frame:
```{r extracting values from data frame}
aliens_df[2]
# Equivalently
aliens_df["planet"]
```

Or we can use double brackets `[[ ]]` or dollar sign notation `$`, both of which produce atomic vectors:
```{r}
aliens_df[[1]]
```

Also, as with matrices, we can use a single bracket with two indices (note that this produces an atomic vector):
```{r}
aliens_df[c(1,2), "number_of_arms"]
```

To us, a data frame resembles a matrix, but to **R** it is more like a list.

## References

Expand Down
Loading

0 comments on commit 4b42bbf

Please # to comment.