README.Rmd

---
output: github_document
---

<!-- README.md is generated from README.Rmd. Please edit that file -->

```{r, echo = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "tools/README-"
)
```

[![stability-stable](https://img.shields.io/badge/stability-stable-green.svg)](https://github.com/joethorley/stability-badges#stable)
[![R build status](https://github.com/poissonconsulting/datacheckr/workflows/R-CMD-check/badge.svg)](https://github.com/poissonconsulting/datacheckr/actions)
[![Codecov test coverage](https://codecov.io/gh/poissonconsulting/datacheckr/branch/master/graph/badge.svg)](https://codecov.io/gh/poissonconsulting/datacheckr?branch=master)
[![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](https://opensource.org/licenses/MIT)
[![CRAN_Status_Badge](http://www.r-pkg.org/badges/version/datacheckr)](https://cran.r-project.org/package=datacheckr)
[![CRAN Downloads](http://cranlogs.r-pkg.org/badges/grand-total/datacheckr)](http://www.r-pkg.org/pkg/cranlogs)

# datacheckr

## Introduction

`datacheckr` is an R package to check the classes and values of scalars and vectors and the column names, classes, values, keys and joins in data frames.
It provides an informative error message if a user-defined condition fails to be met otherwise it returns the object (so it can be used in pipes).

## Demonstration

Consider the data frame `my_data`
```{r, message = FALSE}
library(tibble)

my_data <- data_frame(
  Count = c(0L, 3L, 3L, 0L, NA), 
  Longitude = c(0, 0, 90, 90, 180), 
  Latitude = c(0, 90, 90.2, 100, -180),
  Type = factor(c("Good", "Bad", "Bad", "Bad", "Bad"), levels = c("Good", "Bad")),
  Extra = TRUE,
  Comments = c("In Greenwich", "Somewhere else", "I'm lost", "I didn't see any", "Help"))

my_data
```

### Integers

To specify that `my_data` *must* contain a column called `col1` of class integer use the `check_data2` function
```{r, error = TRUE}
library(datacheckr)
check_data2(my_data, values = list(col1 = integer()))
```

### Missing Values

To specify that a column cannot include missing values pass a single non-missing value (of the correct class)
```{r, error = TRUE}
check_data2(my_data, list(Count = 1L))
```
To specify that it can include missing values include an NA in the vector
```{r, error = TRUE}
check_data2(my_data, list(Count = c(1L, NA)))
```
and to specify that it can only include missing values pass an NA (of the correct class)
```{r, error = TRUE}
check_data2(my_data, list(Count = NA_integer_))
```

## Value Ranges

To indicate that the values must fall within a range use two non-missing values
```{r, error = TRUE}
check_data2(my_data, list(Count = c(0L, 2L)))
```

## Specific Values

If particular values are required then specify them as a vector of three or more non-missing values
```{r, error = TRUE}
check_data2(my_data, list(Count = c(1L, 2L, 2L)))
```

The order of the values in an element is unimportant.

### Numeric, Date and POSIXct

Numeric, Date and POSIXct columns have exactly the same behaviour regarding ranges and specific values as illustrated above using integers.

### Logical

With logical values two non-missing values produce the same behaviour as three or more non-missing values.

For example to test for only `FALSE` values use
```{r, error = TRUE}
check_data2(my_data, list(Extra = c(FALSE, FALSE)))
```

### Characters

The following requires that the values of `Comments` match both character elements which are treated as regular expressions
```{r, error = TRUE}
check_data2(my_data, list(Comments = c("e", "o")))
```
with three or more non-missing character elements each value must match at least one of the elements which are treated as regular expressions.
```{r, error = TRUE}
check_data2(my_data, list(Comments = c("e", "o", "o")))
```

Regular expressions are matched using `grepl` with `perl=TRUE`.

### Factors
To specify that `Type` should be a factor that includes `"Bad1"` and `"Good"` among its levels
```{r, error = TRUE}
check_data2(my_data, list(Type = factor(c("Bad1", "Good"))))
```
And to specify the actual factor levels, pass three or more non-missing values
```{r, error = TRUE}
check_data2(my_data, list(Type = factor(c("Bad", "Good", "Good"))))
```

### Column Names

Whereas `check_data2()` ignores unnamed columns and doesn't care about the order, `check_data3()` requires that column names match the names in values.
```{r, error = TRUE}
check_data3(my_data, list(Comments = character()))
```

### Missing Columns

In contrast, `check_data1()` can be used to test that specific columns are missing or that a column satisfies one of multiple conditions.
```{r, error = TRUE}
check_data1(my_data, list(Comments = NULL))
```

```{r, error = TRUE}
check_data1(my_data, list(Comments = integer(),
                          Comments = numeric()))
```

To specify that `my_data` *can* contain a column `col1` that can
be integer or numeric values the call would be
```{r}
check_data1(my_data, list(
  col1 = integer(), 
  col1 = NULL, 
  col1 = numeric()))
```

### Naming Objects

By default, `datacheckr` determines the name of an object based on the call.
This results in uninformative error messages when used in a pipe
```{r, error = TRUE}
library(magrittr)
my_data %<>% check_data2(values = list(col1 = integer()))
```
The argument `data_name` can be used to define the name
```{r, error = TRUE}
library(magrittr)
my_data %<>% check_data2(values = list(col1 = integer()), data_name = "d8r")
```

### Relational Data

Consider the [relational data](http://r4ds.had.co.nz/relational-data.html#nycflights13-relational) in the `nycflights13` package.

#### Keys

The following code uses the `check_data3` function to confirm that airlines has just two columns `carrier` and `name`, in that order, which are both character vectors and that carrier is unique (a key).

```{r}
library(nycflights13)
check_data3(airlines, list(carrier = "", 
                           name = ""),
            key = "carrier")
```

The next code checks that `airports` has the listed columns in that order and that
`faa` is a unique character vector of three 'word characters', `lat` is a number between 0 and 90, `alt` is an integer between -100 and 10,000, and `dst` is a character vector with the possible values A, N or U.

```{r}
check_data3(airports, list(faa = rep("^\\w{3,3}$",2),
                           name = "",
                           lat = c(0, 90),
                           lon = c(-180, 180),
                           alt = as.integer(c(-100, 10^5L)),
                           tz = c(-11, 11),
                           dst = rep("A|N|U", 2),
                           tzone = ""),
            key = "faa")
```

This checks that planes *includes* tailnum, engines and year (as using less strict `check_data2`) and that engines is 1, 2, 3 or 4, that year is an integer between 1956 and 2013 that can include missing values and tailnum (which consists of strings of 5 to 6 letter 'word characters') is the unique key.
```{r}
check_data2(planes, list(tailnum = rep("^\\w{5,6}$",2),
                         engines = 1:4,
                         year = c(1956L, 2013L, NA)),
            key = "tailnum")
```

#### Selecting Columns

Weather has lots of columns. by setting `select = TRUE` in `check_data3` we drop non-named columns and order to match values.
The checks indicate that year is only 2013, and like month is a number but day and hour are integers (as expected)
```{r}
weather %<>% check_data3(list(year = c(2013,2013),
                              month = c(1, 12),
                              day = c(1L, 31L),
                              hour = c(0L, 23L),
                              origin = rep("^\\w{3,3}$",2)),
                 select = TRUE)
weather
```

#### Joins

Checking the referential integrity of the (many-to-one) join between `flights` and `airlines` is easy.

```{r, error = TRUE}
check_join(flights, airlines, join = "carrier")
```

In addition to `tailnum`, `flights` and `planes` have additional column with the same name.
```{r, error = TRUE}
check_join(flights, planes, join = "tailnum")
```
We can deal with this by setting `extra = TRUE` but the data fail referential integrity because we have planes without flights.
```{r, error = TRUE}
check_join(flights, planes, extra = TRUE, join = "tailnum")
```

## Installation

To install the most recent release from CRAN
```
install.packages("datacheckr")
```

To install the development version from GitHub
```
# install.packages("devtools")
devtools::install_github("datacheckr")
```

## Contribution

Please report any [issues](https://github.com/poissonconsulting/datacheckr/issues).

[Pull requests](https://github.com/poissonconsulting/datacheckr/pulls) are always welcome.

## Code of Conduct

Please note that the datacheckr project is released with a [Contributor Code of Conduct](https://contributor-covenant.org/version/2/0/CODE_OF_CONDUCT.html). By contributing to this project, you agree to abide by its terms.