55-saeb.Rmd

# National Plan and Provider Enumeration System (SAEB) {-}

[![Build Status](https://travis-ci.org/asdfree/saeb.svg?branch=master)](https://travis-ci.org/asdfree/saeb) [![Build status](https://ci.appveyor.com/api/projects/status/github/asdfree/saeb?svg=TRUE)](https://ci.appveyor.com/project/ajdamico/saeb)

The National Plan and Provider Enumeration System (NPPES) contains information about every medical provider, insurance plan, and clearinghouse actively operating in the United States healthcare industry.

* A single large table with one row per enumerated health care provider.

* A census of individuals and organizations who bill for medical services in the United States.

* Updated monthly with new providers.

* Maintained by the United States [Centers for Medicare & Medicaid Services (CMS)](http://www.cms.gov/)

## Simplified Download and Importation {-}

The R `lodown` package easily downloads and imports all available SAEB microdata by simply specifying `"saeb"` with an `output_dir =` parameter in the `lodown()` function. Depending on your internet connection and computer processing speed, you might prefer to run this step overnight.

```{r eval = FALSE }
library(lodown)
lodown( "saeb" , output_dir = file.path( path.expand( "~" ) , "SAEB" ) )
```

## Analysis Examples with base R \ {-}

Load a data frame:

```{r eval = FALSE }
column_names <-
	names( 
		read.csv( 
			file.path( path.expand( "~" ) , "SAEB" , "2015" , "escolas.csv" ) , 
			nrow = 1 )[ FALSE , , ] 
	)

column_names <- gsub( "\\." , "_" , tolower( column_names ) )

column_types <-
	ifelse( 
		SAScii::parse.SAScii(
			file.path( path.expand( "~" ) , "SAEB" , "2015" , "import.sas" ) 
		) , 
		'n' , 'c' 
	)

columns_to_import <-
	c( "entity_type_code" , "provider_gender_code" , "provider_enumeration_date" ,
	"is_sole_proprietor" , "provider_business_practice_location_address_state_name" )

stopifnot( all( columns_to_import %in% column_names ) )

saeb_df <- 
	data.frame( 
		readr::read_csv( 
			file.path( path.expand( "~" ) , "SAEB" , 
				"escolas.csv" ) , 
			col_names = columns_to_import , 
			col_types = 
				paste0( 
					ifelse( column_names %in% columns_to_import , column_types , '_' ) , 
					collapse = "" 
				) ,
			skip = 1
		) 
	)
```

```{r eval = FALSE }

```

### Variable Recoding {-}

Add new columns to the data set:
```{r eval = FALSE }
dbSendQuery( db , "ALTER TABLE ADD COLUMN individual INTEGER" )

dbSendQuery( db , 
	"UPDATE 
	SET individual = 
		CASE WHEN entity_type_code = 1 THEN 1 ELSE 0 END" 
)

dbSendQuery( db , "ALTER TABLE ADD COLUMN provider_enumeration_year INTEGER" )

dbSendQuery( db , 
	"UPDATE 
	SET provider_enumeration_year = 
		CAST( SUBSTRING( provider_enumeration_date , 7 , 10 ) AS INTEGER )" 
)
```

### Unweighted Counts {-}

Count the unweighted number of records in the table, overall and by groups:
```{r eval = FALSE , results = "hide" }
nrow( saeb_df )

table( saeb_df[ , "provider_gender_code" ] , useNA = "always" )
```

### Descriptive Statistics {-}

Calculate the mean (average) of a linear variable, overall and by groups:
```{r eval = FALSE , results = "hide" }
mean( saeb_df[ , "provider_enumeration_year" ] )

tapply(
	saeb_df[ , "provider_enumeration_year" ] ,
	saeb_df[ , "provider_gender_code" ] ,
	mean 
)
```

Calculate the distribution of a categorical variable, overall and by groups:
```{r eval = FALSE , results = "hide" }
prop.table( table( saeb_df[ , "is_sole_proprietor" ] ) )

prop.table(
	table( saeb_df[ , c( "is_sole_proprietor" , "provider_gender_code" ) ] ) ,
	margin = 2
)
```

Calculate the sum of a linear variable, overall and by groups:
```{r eval = FALSE , results = "hide" }
sum( saeb_df[ , "provider_enumeration_year" ] )

tapply(
	saeb_df[ , "provider_enumeration_year" ] ,
	saeb_df[ , "provider_gender_code" ] ,
	sum 
)
```

Calculate the median (50th percentile) of a linear variable, overall and by groups:
```{r eval = FALSE , results = "hide" }
quantile( saeb_df[ , "provider_enumeration_year" ] , 0.5 )

tapply(
	saeb_df[ , "provider_enumeration_year" ] ,
	saeb_df[ , "provider_gender_code" ] ,
	quantile ,
	0.5 
)
```

### Subsetting {-}

Limit your `data.frame` to California:
```{r eval = FALSE , results = "hide" }
sub_saeb_df <- subset( saeb_df , provider_business_practice_location_address_state_name = 'CA' )
```
Calculate the mean (average) of this subset:
```{r eval = FALSE , results = "hide" }
mean( sub_saeb_df[ , "provider_enumeration_year" ] )
```

### Measures of Uncertainty {-}

Calculate the variance, overall and by groups:
```{r eval = FALSE , results = "hide" }
var( saeb_df[ , "provider_enumeration_year" ] )

tapply(
	saeb_df[ , "provider_enumeration_year" ] ,
	saeb_df[ , "provider_gender_code" ] ,
	var 
)
```

### Regression Models and Tests of Association {-}

Perform a t-test:
```{r eval = FALSE , results = "hide" }
t.test( provider_enumeration_year ~ individual , saeb_df )
```

Perform a chi-squared test of association:
```{r eval = FALSE , results = "hide" }
this_table <- table( saeb_df[ , c( "individual" , "is_sole_proprietor" ) ] )

chisq.test( this_table )
```

Perform a generalized linear model:
```{r eval = FALSE , results = "hide" }
glm_result <- 
	glm( 
		provider_enumeration_year ~ individual + is_sole_proprietor , 
		data = saeb_df
	)

summary( glm_result )
```

## Analysis Examples with `dplyr` \ {-}

The R `dplyr` library offers an alternative grammar of data manipulation to base R and SQL syntax. [dplyr](https://github.com/tidyverse/dplyr/) offers many verbs, such as `summarize`, `group_by`, and `mutate`, the convenience of pipe-able functions, and the `tidyverse` style of non-standard evaluation. [This vignette](https://cran.r-project.org/web/packages/dplyr/vignettes/dplyr.html) details the available features. As a starting point for SAEB users, this code replicates previously-presented examples:

```{r eval = FALSE , results = "hide" }
library(dplyr)
saeb_tbl <- tbl_df( saeb_df )
```
Calculate the mean (average) of a linear variable, overall and by groups:
```{r eval = FALSE , results = "hide" }
saeb_tbl %>%
	summarize( mean = mean( provider_enumeration_year ) )

saeb_tbl %>%
	group_by( provider_gender_code ) %>%
	summarize( mean = mean( provider_enumeration_year ) )
```

---

## Replication Example {-}

```{r eval = FALSE , results = "hide" }
dbGetQuery( db , "SELECT COUNT(*) FROM " )
```