Draft specification for labels #45

DougManuel · 2024-11-18T19:45:16Z

The PR has specifications for refactoring labels to address the approach of moving labels out of rec_with_table(). Further, we have little use of labels and metadata, but this is a priority feature.

There is a clearer and more consistent approach to variable labels. I propose we should extend the 'labelled' library, and there are a few ways to do so.

yulric

The document is pretty good but missing more text about what's in scope for the library.

docs/labels.qmd

DougManuel · 2025-02-18T11:16:45Z

I wrote a new version of the scope document. It hopefully addresses the comments, but unfortunately, it requires a new review.

The main body is a higher-level scope. An extensive appendix serves as a reference and explanation. That appendix was developed discussion with a LLM. References are included.

yulric

This new version is way better but I would include more code examples exploring some of your statements. In addition, a lot of what you said in the appendix is actually really useful and should be brought back into the main document. Finally, there are some formatting issues that I can fix.

If you want to address my comments, I can then refactor the document to:

Fix the formatting issues
Bring the appendix text into the main document
Write supporting code examples

yulric · 2025-02-22T04:02:01Z

docs/labels.qmd

+This document outlines the scope and key technical requirements for improving how **`recodeflow`** handles variable and value labels (metadata). The primary objectives are:
+
+-   **Enable reproducible, standardized metadata**: Use proven conventions from `` `haven` `` and `` `labelled` `` so that data analysts can apply labels consistently in R.
+-   **Facilitate collaboration**: Store metadata in human-readable CSV sidecars (`variables` and `variable_details`) for project teams, including those who are not R users.


Can you expand on why you see the CSV files as a sidecar? or what you mean by sidecar? The CSV files are central to the entire flow so it doesn't make sense to me to call them a sidecar.

yulric · 2025-02-22T04:04:11Z

docs/labels.qmd

+-   **Enable reproducible, standardized metadata**: Use proven conventions from `` `haven` `` and `` `labelled` `` so that data analysts can apply labels consistently in R.
+-   **Facilitate collaboration**: Store metadata in human-readable CSV sidecars (`variables` and `variable_details`) for project teams, including those who are not R users.
+-   **Support data transformations**: Preserve labels as data is pivoted, joined, or subsetted, while respecting the changing context of variables.
+-   **Plan for future needs**: Align with existing R ecosystem packages and keep open the possibility of supporting DDI, LinkML, or extended role-based metadata in subsequent releases.


Can you explain "role-base metadata" within the context of labels?

yulric · 2025-02-22T04:22:58Z

docs/labels.qmd

+
+## 1. Integration with the R ecosystem
+
+The label system in `recodeflow` will rely on **existing standards** from packages like `` `haven` `` and `` `labelled` ``: - Use `attr(x, "label")` to store the variable label (e.g., `"Age of respondent"`). - Use `attr(x, "labels")` (with class `` `haven`_`labelled` ``) to store **value labels** (e.g., `0 = "Censored"`, `1 = "Transplant"`). - Avoid duplicating existing functions for applying labels. Instead, harness utilities such as `` `labelled`::var_label() `` and `` `labelled`::val_labels() ``, and provide **wrapper functions** only if they add new `recodeflow`-specific features (e.g., syncing sidecar metadata).


Do you have a references that show that label and labels are the attributes to use for variable and category labels?

For the second point, not sure what you mean by with class haven_labelled? Isn't the attr function from base R?

yulric · 2025-02-22T04:25:48Z

docs/labels.qmd

+
+To illustrate label handling, consider a minimal example:
+
+``` r


Do you want this chunk to be executing? Right it isn't because the r should be {r}

yulric · 2025-02-22T04:29:32Z

docs/labels.qmd

+
+# `labelled` package approach
+library(`labelled`)
+data <- set_variable_labels(data, status = "Patient status")


I recommend not importing the library but using the :: operator. This way its easy to know where the function comes from. Its obvious here but that's the standard we probably want in the library.

yulric · 2025-02-22T04:52:03Z

docs/labels.qmd

+
+-   **Memory efficiency:** Storing labels as attributes (which are essentially small vectors or strings attached to each column) is generally lightweight. A variable label is just a single string, and value labels are a vector of label mappings which, in most cases, is much smaller than the data itself (since it only has one entry per unique code). We will ensure that these attributes are only as large as necessary and avoid duplicating data. For example, if a variable has 100 possible values but only 5 are actually labeled, we will only store those 5 label pairs, not all 100 values. This is consistent with how `haven`\_`labelled` works (labels need not be exhaustive) . For extremely large taxonomies or code lists, we will document best practices (such as using factors or reference tables) to avoid enormous in-memory label vectors.
+
+-   **Avoiding data copies:** In R, adding or modifying attributes to a vector can sometimes trigger a copy of that vector in memory. To mitigate this, `recodeflow` will try to set labels at the time of variable creation or recoding (so that attributes are added before the variable is used widely, reducing duplication in memory). Bulk operations will be used where possible – for instance, applying variable labels to all data frame columns in a single function call, rather than iterating over columns in R, to leverage vectorized internal operations. The set_data_labels() function (or its improved equivalent) will label an entire dataset in one go, which is more efficient than labeling column-by-column in a loop .


Reference for the first sentence?

yulric · 2025-02-22T04:54:16Z

docs/labels.qmd

+
+-   **Avoiding data copies:** In R, adding or modifying attributes to a vector can sometimes trigger a copy of that vector in memory. To mitigate this, `recodeflow` will try to set labels at the time of variable creation or recoding (so that attributes are added before the variable is used widely, reducing duplication in memory). Bulk operations will be used where possible – for instance, applying variable labels to all data frame columns in a single function call, rather than iterating over columns in R, to leverage vectorized internal operations. The set_data_labels() function (or its improved equivalent) will label an entire dataset in one go, which is more efficient than labeling column-by-column in a loop .
+
+-   **Efficient lookup and manipulation:** Functions to get or set labels will be optimized to avoid unnecessary data scanning. Retrieving a label should be near O(1) (constant time) since it can directly fetch the attribute or look up in a keyed sidecar table. Setting a label will typically be O(1) for a variable label (assign an attribute) or O(m) for m labels in a value labels vector (which is usually small). We will also consider using `data.table` or keyed `tibbles` for the sidecar metadata to allow fast joins or lookups by variable name when applying labels to data, ensuring that even if there are thousands of variables, the labeling operation remains fast.


References or example code to prove these performance text?

yulric · 2025-02-22T17:21:49Z

docs/labels.qmd

+
+-   **No Interference with tagged NAs**: Operations on labels will not remove or corrupt tagged NA values. For instance, if a variable contains NA(b) to indicate some missing category, setting or getting labels for that variable should not turn NA(b) into a regular NA or a numeric code. The underlying data type remains the same. Our implementation will recognize `haven`’s tagged NA class and handle it accordingly. If we summarize or print labeled data, we will rely on haven/labelled print methods that already know how to display tagged NAs (showing NA(b) etc.) . If any custom processing is done, the NA status will be preserved. In essence, tagged NAs will be treated as a valid “value” in the value labels vector (so they can have a label), but we won’t inadvertently treat them as ordinary values.
+
+-   **Clear specification for missing in metadata:** The specification will clarify how to denote missing values in `variables.csv` or `variable_details.csv`. For example, we may decide that within the metadata, a certain column (or perhaps a special notation in the labels) indicates missing. Some systems use an annotation like “” next to the label. We might introduce a convention such as: if a value label is enclosed in brackets or flagged in an additional column (e.g., a boolean missing column), `recodeflow` will know to handle that code as missing. This detail will be ironed out in the tagged NA feature, but the label functionality will be aware of it. The result should be that if a user marks a code as missing in the metadata, after applying labels, the R data’s values for that code are converted to NA (with tag if needed) and carry the appropriate label. For SPSS-like user missing values, we will consider using the class `haven`\_`labelled`\_spss which allows storing a separate na_values attribute for truly round-trip compatibility , but even if not, the combination of tagged NA and label will serve the purpose.


The sheets already have a way of representing tagged NA values by using NA::a, NA::b. Were you thinking of changing that?

yulric · 2025-02-22T17:26:18Z

docs/labels.qmd

+
+While the primary focus is on labels, the design will allow extensibility to other metadata fields such as short vs. long labels, descriptions, and notes. We will incorporate these to complement the existing variables and variable_details structure, rather than overloading the R objects with extra attributes.
+
+-   **Short and long variable labels:** Many datasets distinguish between a short label (a brief name) and a longer description. For example, a variable BMI might have a short label “Body Mass Index” and a long description “BMI, calculated as weight (kg) / \[height (m)\]\^2”. In `recodeflow`, we plan to support storing both. The variables sheet can have separate columns, e.g., label_short and label_long (or description). The short label would be what we treat as the “label” attribute in R (concise enough to show in a data viewer), while the long label/description would be stored only in the sidecar (and used for documentation generation like codebooks or when someone needs more context). This aligns with practices in codebook standards and IPUMS-like metadata , where variable labels and descriptions are distinct fields. We will ensure the functions allow access to both (e.g., **get_variable_description(name)** might retrieve the long description). However, to keep the R object lightweight, we will not attach the full long description as an attribute to every vector (which could be quite large text). Instead, the long descriptions live in `variable_details` or variables tables and can be queried on demand.


Examples of codebook standards and IPUMS metadata?

yulric · 2025-02-22T17:34:09Z

docs/labels.qmd

+
+-   **Documentation of expected behavior:** The package vignettes and function references will describe how labels are preserved or modified during common operations. For example, if a user uses `dplyr::mutate()` to create a new variable from labeled variables, what happens? We will specify that by default, new variables will have no label (unless the user explicitly sets one), and existing labels will carry over to unmodified variables. If an existing labeled variable is transformed (e.g., scaled or recoded), the label may be dropped because the content has changed – unless our function `recode_with_table` handles it internally. We will integrate with known behaviors; for instance, when using haven::as_factor() on a `labelled` variable, the result is a factor with levels corresponding to the value labels, but the variable-level label is usually lost in that conversion. Users should be aware of this. In cases where R’s native or tidyverse behavior might strip labels, we will mention it and often provide a workaround or our own function. A section of the vignette could be “Working with labeled data: do’s and don’ts”, covering things like subsetting (which retains labels on the subset), combining data frames, pivoting, etc.
+
+-   **Preservation of metadata in transformations:** We will aim for labels to survive through data transformations as much as logically possible. For instance:


Can you provide code examples of how you imaging this would work for each transformation type? Would we override the dplyr's mutate method to preserve labels? Would we have a separate function that does this?

Draft specification for labels

96cf40d

DougManuel requested a review from yulric November 18, 2024 19:45

DougManuel added the enhancement label Nov 18, 2024

yulric marked this pull request as ready for review November 19, 2024 14:56

Add label helper for labels

cbfe894

yulric requested changes Nov 21, 2024

View reviewed changes

DougManuel added 3 commits November 25, 2024 21:36

Delete pbc_variable_details.xlsx added by mistake. Update renv

ef30d55

Update introduction with an introduction of worksheet. PBC example.

8cc9bae

Label scope - draft #2

f059076

DougManuel requested a review from yulric February 18, 2025 11:16

remove renv files from the index

fc3a593

yulric requested changes Feb 22, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Draft specification for labels #45

Draft specification for labels #45

DougManuel commented Nov 18, 2024

yulric left a comment

DougManuel commented Feb 18, 2025

yulric left a comment

yulric Feb 22, 2025

yulric Feb 22, 2025

yulric Feb 22, 2025

yulric Feb 22, 2025

yulric Feb 22, 2025

yulric Feb 22, 2025

yulric Feb 22, 2025

yulric Feb 22, 2025

yulric Feb 22, 2025

yulric Feb 22, 2025

yulric Feb 22, 2025


		## 1. Integration with the R ecosystem

		The label system in `recodeflow` will rely on existing standards from packages like `` `haven` `` and `` `labelled` ``: - Use `attr(x, "label")` to store the variable label (e.g., `"Age of respondent"`). - Use `attr(x, "labels")` (with class `` `haven`_`labelled` ``) to store value labels (e.g., `0 = "Censored"`, `1 = "Transplant"`). - Avoid duplicating existing functions for applying labels. Instead, harness utilities such as `` `labelled`::var_label() `` and `` `labelled`::val_labels() ``, and provide wrapper functions only if they add new `recodeflow`-specific features (e.g., syncing sidecar metadata).


		To illustrate label handling, consider a minimal example:

		``` r


		- Memory efficiency: Storing labels as attributes (which are essentially small vectors or strings attached to each column) is generally lightweight. A variable label is just a single string, and value labels are a vector of label mappings which, in most cases, is much smaller than the data itself (since it only has one entry per unique code). We will ensure that these attributes are only as large as necessary and avoid duplicating data. For example, if a variable has 100 possible values but only 5 are actually labeled, we will only store those 5 label pairs, not all 100 values. This is consistent with how `haven`\_`labelled` works (labels need not be exhaustive) . For extremely large taxonomies or code lists, we will document best practices (such as using factors or reference tables) to avoid enormous in-memory label vectors.

		- Avoiding data copies: In R, adding or modifying attributes to a vector can sometimes trigger a copy of that vector in memory. To mitigate this, `recodeflow` will try to set labels at the time of variable creation or recoding (so that attributes are added before the variable is used widely, reducing duplication in memory). Bulk operations will be used where possible – for instance, applying variable labels to all data frame columns in a single function call, rather than iterating over columns in R, to leverage vectorized internal operations. The set_data_labels() function (or its improved equivalent) will label an entire dataset in one go, which is more efficient than labeling column-by-column in a loop .


		- Avoiding data copies: In R, adding or modifying attributes to a vector can sometimes trigger a copy of that vector in memory. To mitigate this, `recodeflow` will try to set labels at the time of variable creation or recoding (so that attributes are added before the variable is used widely, reducing duplication in memory). Bulk operations will be used where possible – for instance, applying variable labels to all data frame columns in a single function call, rather than iterating over columns in R, to leverage vectorized internal operations. The set_data_labels() function (or its improved equivalent) will label an entire dataset in one go, which is more efficient than labeling column-by-column in a loop .

		- Efficient lookup and manipulation: Functions to get or set labels will be optimized to avoid unnecessary data scanning. Retrieving a label should be near O(1) (constant time) since it can directly fetch the attribute or look up in a keyed sidecar table. Setting a label will typically be O(1) for a variable label (assign an attribute) or O(m) for m labels in a value labels vector (which is usually small). We will also consider using `data.table` or keyed `tibbles` for the sidecar metadata to allow fast joins or lookups by variable name when applying labels to data, ensuring that even if there are thousands of variables, the labeling operation remains fast.


		- No Interference with tagged NAs: Operations on labels will not remove or corrupt tagged NA values. For instance, if a variable contains NA(b) to indicate some missing category, setting or getting labels for that variable should not turn NA(b) into a regular NA or a numeric code. The underlying data type remains the same. Our implementation will recognize `haven`’s tagged NA class and handle it accordingly. If we summarize or print labeled data, we will rely on haven/labelled print methods that already know how to display tagged NAs (showing NA(b) etc.) . If any custom processing is done, the NA status will be preserved. In essence, tagged NAs will be treated as a valid “value” in the value labels vector (so they can have a label), but we won’t inadvertently treat them as ordinary values.

		- Clear specification for missing in metadata: The specification will clarify how to denote missing values in `variables.csv` or `variable_details.csv`. For example, we may decide that within the metadata, a certain column (or perhaps a special notation in the labels) indicates missing. Some systems use an annotation like “” next to the label. We might introduce a convention such as: if a value label is enclosed in brackets or flagged in an additional column (e.g., a boolean missing column), `recodeflow` will know to handle that code as missing. This detail will be ironed out in the tagged NA feature, but the label functionality will be aware of it. The result should be that if a user marks a code as missing in the metadata, after applying labels, the R data’s values for that code are converted to NA (with tag if needed) and carry the appropriate label. For SPSS-like user missing values, we will consider using the class `haven`\_`labelled`\_spss which allows storing a separate na_values attribute for truly round-trip compatibility , but even if not, the combination of tagged NA and label will serve the purpose.


		While the primary focus is on labels, the design will allow extensibility to other metadata fields such as short vs. long labels, descriptions, and notes. We will incorporate these to complement the existing variables and variable_details structure, rather than overloading the R objects with extra attributes.

		- Short and long variable labels: Many datasets distinguish between a short label (a brief name) and a longer description. For example, a variable BMI might have a short label “Body Mass Index” and a long description “BMI, calculated as weight (kg) / \[height (m)\]\^2”. In `recodeflow`, we plan to support storing both. The variables sheet can have separate columns, e.g., label_short and label_long (or description). The short label would be what we treat as the “label” attribute in R (concise enough to show in a data viewer), while the long label/description would be stored only in the sidecar (and used for documentation generation like codebooks or when someone needs more context). This aligns with practices in codebook standards and IPUMS-like metadata , where variable labels and descriptions are distinct fields. We will ensure the functions allow access to both (e.g., get_variable_description(name) might retrieve the long description). However, to keep the R object lightweight, we will not attach the full long description as an attribute to every vector (which could be quite large text). Instead, the long descriptions live in `variable_details` or variables tables and can be queried on demand.


		- Documentation of expected behavior: The package vignettes and function references will describe how labels are preserved or modified during common operations. For example, if a user uses `dplyr::mutate()` to create a new variable from labeled variables, what happens? We will specify that by default, new variables will have no label (unless the user explicitly sets one), and existing labels will carry over to unmodified variables. If an existing labeled variable is transformed (e.g., scaled or recoded), the label may be dropped because the content has changed – unless our function `recode_with_table` handles it internally. We will integrate with known behaviors; for instance, when using haven::as_factor() on a `labelled` variable, the result is a factor with levels corresponding to the value labels, but the variable-level label is usually lost in that conversion. Users should be aware of this. In cases where R’s native or tidyverse behavior might strip labels, we will mention it and often provide a workaround or our own function. A section of the vignette could be “Working with labeled data: do’s and don’ts”, covering things like subsetting (which retains labels on the subset), combining data frames, pivoting, etc.

		- Preservation of metadata in transformations: We will aim for labels to survive through data transformations as much as logically possible. For instance:

Draft specification for labels #45

Are you sure you want to change the base?

Draft specification for labels #45

Conversation

DougManuel commented Nov 18, 2024

yulric left a comment

Choose a reason for hiding this comment

DougManuel commented Feb 18, 2025

yulric left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment