Workflow for using the Postal Code Conversion File Plus (PCCF+) with R. Currently developed to be used with PCCF+ 7E.
Statistics Canada and Canada Post are the owners of the copyright in the PCCF+. This repository does not provide or reconstruct the PCCF+, nor does it circumvent the need to obtain a copy and a license of the PCCF+. Use of code in this repository requires that you have already obtained access to the PCCF+ (e.g. via the Community Data Program).
This repository uses the renv
package for isolation and reproducibility. Launching the R project (via PCCFplus-with-R.Rproj
) will automatically download a compatible version of renv
if not already present, and build the project-local library after prompting you to run the renv::restore()
command. This is highly recommended, but not required. Alternatively, you can run the scripts using your main R library without renv
, understanding that manual installation of missing packages will be required and that mismatched package versions could result in errors.
A description follows of the two main scripts in this repository:
Converts the PCCF+'s ASCII data files (must be provided by user in data/txt
directory) to a friendlier CSV format (will output to data/csv
directory).
This code is programmatically generated by meta/meta.R
, utilizing the data dictionaries from the PCCF+ manual (meta/data_dictionary.R
). Since the output of meta/meta.R
is already provided (text_to_csv.R
), code in the meta
directory does not need to be executed by the typical user unless changes to code generation are desired.
To use this script, you must have already populated the the data/csv
directory by executing text_to_csv.R
.
Specify your region of interest with the DA_list
variable near the top of the file as a vector of dissemination area IDs. For this DA_list
, regional_subset.R
will generate a weighted many-to-many relationship table that can be used for weighted assignment of 6, 5, 4, or 3 character residential postal codes to dissemination areas. This relationship table is output to data/PCCF+ 7E regional subset.csv
.
When matching postal codes to the relationship table, first attempt to match to the full length postal code. If there is no match, drop the last character of the postal code and try again. Repeat this until you have a match or only the first 3 characters of the postal code (FSA) remain. If you do not have a match for a postal code at least the FSA level, that postal code can be assumed to be entirely contained out of region.
In the relationship table, FSAs with at least a single postal code with weight in-region will be included in their entirety - with some simplification. The weight for all out of region dissemination areas is pooled, and assigned a DA ID of 35000000. The purpose of including this explicit out of region area is to prevent erroneously reducing postal code length in the matching process when there an out of region match at a longer length.
Note that when using the output of this script you will not get warnings or quality indicators like you would when using the PCCF+ product. Use at your own risk.
Contributions and improvements are welcome! This can range from minor (e.g. project dependency updates, documentation, code comments) to more substantial bugfixes or feature additions (e.g. increased generality, support for institutional postal codes, warnings, or quality indicators).