Skip to content

coding principles

Ivan Rudik edited this page May 7, 2019 · 2 revisions

We generally want to follow the Gentzkow and Shapiro code structure and data storage protocols. Several basic things to re-emphasize:

  • The entire project, from initial data to compiling the paper pdf, can be run from one script, typically ~/git/project-name/make-paper.r. This R script will call other files, e.g. Stata, R, and LaTeX.
  • Do not manually pre-process data, e.g. manipulate Excel sheets, before importing into R or Stata. All data processing, beginning with the original file, should be automated and, in the final version, called by make-paper.r.
  • Keep code less than 100 characters wide so that it is easy to read.
  • Each dataset has a valid (unique, non-missing) key / observation ID. For example, you might have dataset of US county characteristics, e.g. square miles and 1969 population, with one row for each county, and the key being the 1000*state_fips+county_fips.
  • Keep datasets normalized (meaning that they contain only variables at the same logical level as the key) as late in the data preparation process as possible. Once you merge a state-level dataset with a county-level dataset, the state-level variables are recorded many times (one for each county). This takes a lot of space and can also confuse other aspects of data preparation.
Clone this wiki locally