-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Draft specification for labels #45
base: scoping-doc
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The document is pretty good but missing more text about what's in scope for the library.
I wrote a new version of the scope document. It hopefully addresses the comments, but unfortunately, it requires a new review. The main body is a higher-level scope. An extensive appendix serves as a reference and explanation. That appendix was developed discussion with a LLM. References are included. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This new version is way better but I would include more code examples exploring some of your statements. In addition, a lot of what you said in the appendix is actually really useful and should be brought back into the main document. Finally, there are some formatting issues that I can fix.
If you want to address my comments, I can then refactor the document to:
- Fix the formatting issues
- Bring the appendix text into the main document
- Write supporting code examples
This document outlines the scope and key technical requirements for improving how **`recodeflow`** handles variable and value labels (metadata). The primary objectives are: | ||
|
||
- **Enable reproducible, standardized metadata**: Use proven conventions from `` `haven` `` and `` `labelled` `` so that data analysts can apply labels consistently in R. | ||
- **Facilitate collaboration**: Store metadata in human-readable CSV sidecars (`variables` and `variable_details`) for project teams, including those who are not R users. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you expand on why you see the CSV files as a sidecar? or what you mean by sidecar? The CSV files are central to the entire flow so it doesn't make sense to me to call them a sidecar.
- **Enable reproducible, standardized metadata**: Use proven conventions from `` `haven` `` and `` `labelled` `` so that data analysts can apply labels consistently in R. | ||
- **Facilitate collaboration**: Store metadata in human-readable CSV sidecars (`variables` and `variable_details`) for project teams, including those who are not R users. | ||
- **Support data transformations**: Preserve labels as data is pivoted, joined, or subsetted, while respecting the changing context of variables. | ||
- **Plan for future needs**: Align with existing R ecosystem packages and keep open the possibility of supporting DDI, LinkML, or extended role-based metadata in subsequent releases. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain "role-base metadata" within the context of labels?
|
||
## 1. Integration with the R ecosystem | ||
|
||
The label system in `recodeflow` will rely on **existing standards** from packages like `` `haven` `` and `` `labelled` ``: - Use `attr(x, "label")` to store the variable label (e.g., `"Age of respondent"`). - Use `attr(x, "labels")` (with class `` `haven`_`labelled` ``) to store **value labels** (e.g., `0 = "Censored"`, `1 = "Transplant"`). - Avoid duplicating existing functions for applying labels. Instead, harness utilities such as `` `labelled`::var_label() `` and `` `labelled`::val_labels() ``, and provide **wrapper functions** only if they add new `recodeflow`-specific features (e.g., syncing sidecar metadata). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have a references that show that label
and labels
are the attributes to use for variable and category labels?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the second point, not sure what you mean by with class haven_labelled
? Isn't the attr function from base R?
|
||
To illustrate label handling, consider a minimal example: | ||
|
||
``` r |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want this chunk to be executing? Right it isn't because the r should be {r}
|
||
# `labelled` package approach | ||
library(`labelled`) | ||
data <- set_variable_labels(data, status = "Patient status") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recommend not importing the library but using the ::
operator. This way its easy to know where the function comes from. Its obvious here but that's the standard we probably want in the library.
|
||
- **Memory efficiency:** Storing labels as attributes (which are essentially small vectors or strings attached to each column) is generally lightweight. A variable label is just a single string, and value labels are a vector of label mappings which, in most cases, is much smaller than the data itself (since it only has one entry per unique code). We will ensure that these attributes are only as large as necessary and avoid duplicating data. For example, if a variable has 100 possible values but only 5 are actually labeled, we will only store those 5 label pairs, not all 100 values. This is consistent with how `haven`\_`labelled` works (labels need not be exhaustive) . For extremely large taxonomies or code lists, we will document best practices (such as using factors or reference tables) to avoid enormous in-memory label vectors. | ||
|
||
- **Avoiding data copies:** In R, adding or modifying attributes to a vector can sometimes trigger a copy of that vector in memory. To mitigate this, `recodeflow` will try to set labels at the time of variable creation or recoding (so that attributes are added before the variable is used widely, reducing duplication in memory). Bulk operations will be used where possible – for instance, applying variable labels to all data frame columns in a single function call, rather than iterating over columns in R, to leverage vectorized internal operations. The set_data_labels() function (or its improved equivalent) will label an entire dataset in one go, which is more efficient than labeling column-by-column in a loop . |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reference for the first sentence?
|
||
- **Avoiding data copies:** In R, adding or modifying attributes to a vector can sometimes trigger a copy of that vector in memory. To mitigate this, `recodeflow` will try to set labels at the time of variable creation or recoding (so that attributes are added before the variable is used widely, reducing duplication in memory). Bulk operations will be used where possible – for instance, applying variable labels to all data frame columns in a single function call, rather than iterating over columns in R, to leverage vectorized internal operations. The set_data_labels() function (or its improved equivalent) will label an entire dataset in one go, which is more efficient than labeling column-by-column in a loop . | ||
|
||
- **Efficient lookup and manipulation:** Functions to get or set labels will be optimized to avoid unnecessary data scanning. Retrieving a label should be near O(1) (constant time) since it can directly fetch the attribute or look up in a keyed sidecar table. Setting a label will typically be O(1) for a variable label (assign an attribute) or O(m) for m labels in a value labels vector (which is usually small). We will also consider using `data.table` or keyed `tibbles` for the sidecar metadata to allow fast joins or lookups by variable name when applying labels to data, ensuring that even if there are thousands of variables, the labeling operation remains fast. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
References or example code to prove these performance text?
|
||
- **No Interference with tagged NAs**: Operations on labels will not remove or corrupt tagged NA values. For instance, if a variable contains NA(b) to indicate some missing category, setting or getting labels for that variable should not turn NA(b) into a regular NA or a numeric code. The underlying data type remains the same. Our implementation will recognize `haven`’s tagged NA class and handle it accordingly. If we summarize or print labeled data, we will rely on haven/labelled print methods that already know how to display tagged NAs (showing NA(b) etc.) . If any custom processing is done, the NA status will be preserved. In essence, tagged NAs will be treated as a valid “value” in the value labels vector (so they can have a label), but we won’t inadvertently treat them as ordinary values. | ||
|
||
- **Clear specification for missing in metadata:** The specification will clarify how to denote missing values in `variables.csv` or `variable_details.csv`. For example, we may decide that within the metadata, a certain column (or perhaps a special notation in the labels) indicates missing. Some systems use an annotation like “” next to the label. We might introduce a convention such as: if a value label is enclosed in brackets or flagged in an additional column (e.g., a boolean missing column), `recodeflow` will know to handle that code as missing. This detail will be ironed out in the tagged NA feature, but the label functionality will be aware of it. The result should be that if a user marks a code as missing in the metadata, after applying labels, the R data’s values for that code are converted to NA (with tag if needed) and carry the appropriate label. For SPSS-like user missing values, we will consider using the class `haven`\_`labelled`\_spss which allows storing a separate na_values attribute for truly round-trip compatibility , but even if not, the combination of tagged NA and label will serve the purpose. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The sheets already have a way of representing tagged NA values by using NA::a, NA::b. Were you thinking of changing that?
|
||
While the primary focus is on labels, the design will allow extensibility to other metadata fields such as short vs. long labels, descriptions, and notes. We will incorporate these to complement the existing variables and variable_details structure, rather than overloading the R objects with extra attributes. | ||
|
||
- **Short and long variable labels:** Many datasets distinguish between a short label (a brief name) and a longer description. For example, a variable BMI might have a short label “Body Mass Index” and a long description “BMI, calculated as weight (kg) / \[height (m)\]\^2”. In `recodeflow`, we plan to support storing both. The variables sheet can have separate columns, e.g., label_short and label_long (or description). The short label would be what we treat as the “label” attribute in R (concise enough to show in a data viewer), while the long label/description would be stored only in the sidecar (and used for documentation generation like codebooks or when someone needs more context). This aligns with practices in codebook standards and IPUMS-like metadata , where variable labels and descriptions are distinct fields. We will ensure the functions allow access to both (e.g., **get_variable_description(name)** might retrieve the long description). However, to keep the R object lightweight, we will not attach the full long description as an attribute to every vector (which could be quite large text). Instead, the long descriptions live in `variable_details` or variables tables and can be queried on demand. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Examples of codebook standards and IPUMS metadata?
|
||
- **Documentation of expected behavior:** The package vignettes and function references will describe how labels are preserved or modified during common operations. For example, if a user uses `dplyr::mutate()` to create a new variable from labeled variables, what happens? We will specify that by default, new variables will have no label (unless the user explicitly sets one), and existing labels will carry over to unmodified variables. If an existing labeled variable is transformed (e.g., scaled or recoded), the label may be dropped because the content has changed – unless our function `recode_with_table` handles it internally. We will integrate with known behaviors; for instance, when using haven::as_factor() on a `labelled` variable, the result is a factor with levels corresponding to the value labels, but the variable-level label is usually lost in that conversion. Users should be aware of this. In cases where R’s native or tidyverse behavior might strip labels, we will mention it and often provide a workaround or our own function. A section of the vignette could be “Working with labeled data: do’s and don’ts”, covering things like subsetting (which retains labels on the subset), combining data frames, pivoting, etc. | ||
|
||
- **Preservation of metadata in transformations:** We will aim for labels to survive through data transformations as much as logically possible. For instance: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you provide code examples of how you imaging this would work for each transformation type? Would we override the dplyr's mutate method to preserve labels? Would we have a separate function that does this?
The PR has specifications for refactoring labels to address the approach of moving labels out of rec_with_table(). Further, we have little use of labels and metadata, but this is a priority feature.
There is a clearer and more consistent approach to variable labels. I propose we should extend the 'labelled' library, and there are a few ways to do so.