diff --git a/README.md b/README.md index 414cff7..7a9cb5b 100644 --- a/README.md +++ b/README.md @@ -1,16 +1,25 @@ # Voting Abroad + +This project was prepared in collaboration with my (amazing) wife Vera (also a student). + ## Data analysis of exit polls and official results of the 2024 Russian presidential election abroad This project is organized to be reproducible, following [reproducible research workflow from Aaron Gullickson](https://github.com/AaronGullickson/research-template). -Some of the computaations in this project are computationally expensive. There are two ways in which this is tackled to aid reproducibility experience. The first one is caching, meaning that most models, printing of large text outputs and other expensive operations are cached on first render and then loaded from cache. Cache is not checked into the release. This means that when downloading a release version of the project and running `quarto render` or (in Rstidio) going `Build -> Render Project` for the first time, the project will take a (very-very-very) long time and may exceed your machine's in-memory storage (RAM) or processing capabilities. +Some of the computations in this project are computationally expensive. The total runtime when rendering the project from scratch is around 2.5 hours on my machine (16 GB RAM, AMD Ryzen 7 5800H). There are two ways in which this is tackled to aid reproducibility experience. The first one is caching, which is turned on a per-chunk basis, allowing for most models, printing of large text outputs and other expensive operations to be cached on first render and then loaded from cache. Cache is not checked into version control and the releases. This means that when downloading a release version of the project and running `quarto render` or (in RStudio) going `Build -> Render Project` for the first time, the project will take a long time and may exceed your machine's in-memory storage (RAM) or processing capabilities. + +To make the project friendly to those who can't reproduce expensive computations, I write results of models into a separate folder, `data/models_norender/`. If a user wishes to use these pre-loaded versions in place of the preferred setup, they will need to manually edit scripts to load these files in place of computations. This would typically involve manipulating `eval: ` flags in chunk options. + +Quarto's `freeze` capabilities are not implemented, as they disallow reproducible setup used here. The `remove_artifacts.ts` script removes the folder with the built data (`data_built/`) on project render. `freeze` can only be enabled at project level and it prevents the built data from being written again. -Raw data comes from data/data_raw folder. data/data_built is a folder with built datasets from data_raw. raw_data_prep.R and data_building.R are responsible for that - they live in scripts/data_building. data_built is fully wiped on project render only. Then there are quarto notebooks for different purposes. They are (in order of rendering): +Raw data comes from data/data_raw folder. data/data_built is a folder with built datasets with `raw_data_prep.qmd` and `data_building.qmd` from `data_raw`. Quarto notebooks take care of data building, processing, analysis and modelling. They are (in order of rendering): -- individual_level/variables_descriptives.qmd, responsible for recoding exit poll questionnaire, broad data missingness diagnostics and some narrative as to what things mean -- individual_level/imputation.qmd, needed to try and address the data missingness issues. Note that running it and not rendering is computationally intensive. There is also a (light) parallelization set to nCores() - 1 -- individual_level/models.qmd - here you can find individual-level (binary, multinomial and nested logit) models and mixed effects (nested logit) models. The latter include extensive convergence diagnostics. Models saved within this script go into scripts/models and are reproducible from code but not when building the project - too computationally intensive. -- country_level/models_country.qmd, responsible for running aggregated-to-country-level linear models and exit poll selection logit. Those rewrite their outputs in scripts/models. -- scripts/graphs/graphs.qmd - a bit of a hacky way to avoid having too much code in the main paper doc. Imports data from most previous scripts and modifies it so that only ggplot and kable need to be run in the paper.qmd. -- presentation/presentation.qmd - a presentation, fully reproducible and untouched since presented before the project was finished. -- paper/paper.qmd - the final paper, a compilation of all of the above. +- `scripts/raw_data_prep.qmd` prepares individual-level exit poll data. dubbed raw by exit poll organization. +- `scripts/data_building` cleans, combines and aggregates data from all sources and on all levels. Dataset generated here is generally used for all subsequent analyses. +- `scripts/variables_descriptives.qmd`, responsible for recoding exit poll questionnaire, broad data missingness diagnostics and some narrative as to what things mean. +- `scripts/imputation.qmd`, needed to try and address the data missingness issues. Note that running it and not rendering is computationally intensive. There is also a (light) parallelization set to nCores() - 1. +- `scripts/simple_models.qmd` - here you can find individual-level (binary, multinomial and nested logit) models. +- `scripts/mixed_effects_models.qmd` present mixed effects (nested logit) models, including extensive convergence diagnostics. +- `scripts/models_country.qmd`, responsible for running aggregated-to-country-level linear models and exit poll selection logit. Those rewrite their outputs in scripts/models. +- `presentation/presentation.qmd` - a presentation, fully reproducible and untouched since presented before the project was finished. +- `paper/paper.qmd` - the final paper, a compilation of all of the above.