-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
different: Features and API design #7
Comments
|
|
|
What about remote tables? Like using |
diff_tbl
classCurrent State
There are two main features of
different
that have been implemented to date:diff_compare(x, y)
generates adiff_tbl
object that contains a tidy summary of the differences betweenx
andy
, anddiff_report(x, y)
creats a self contained HTML report.Behind the scenes, a major component of the differencing process is the alignment of the dataframe rows, which is handled by
different:::align_data_frames()
. Column alignment is exclusively by matching column names.diff_tbl
classThe use of the
diff_tbl
class is limited primarily to implementing methods for the following generics.print
print.diff_tbl
gives simple output likeor
or
summary
summary.diff_tbl
gives a more detailed view of the differences betweenx
andy
.Output
or
Output
plot
plot.diff_tbl
gives a visual summary of differences by row, where differences (misses) are encoded in red, and values that exist only in thex
ory
data frames are colored blue or green.Helpers
There are also helper methods for
as_tibble()
(convert to standard tibble) andmetadata()
(extractdiff_meta
from the attributes of thediff_tbl
object).Desired API
Missing Pieces
The current workflow requires that rows and columns of the two data frames can be matched easily by matching keys and column names.
Depending on the "distance" between the two data sets, getting both sides to the point where they "snap together" can be a significant data wrangling burden. It would be nice to develop a workflow that assists in this process in a more explicit manner and with the following tasks:
Matching columns
Row alignment
Value normalization
Each of the above steps can be handled with standard
dplyr
processing, but there are two key points of friction that thedifferent
API can help alleviate.Consolidate the exploratory code into helper functions that run repeated tests across columns, for example in identifying the equivalence maps between values in an
x
column and ay
column. This process will still require manual decision making butdifferent
can streamline the exploration. This is also, in part, where the embedded Shiny app comes into play.Track all of the dimensions of difference between the two data frames. The pre-processing steps required to reconcile the two data frames are, themselves, part of the differences between the two data frames. But
diff_compare()
doesn't know about these steps and thus can only report on the differences identified between the processedx
andy
data.Ideally, by involving
different
in the reconciliation process, we can track and report these steps in addition to the final differences between the two objects.Behind the Scenes
Link the data early in a new
diff_pair
class. Ideally, the two data frames can be linked together immediately after import. By relying on R's copy-on-modify semantics, we can reference the original data in the new class.Internally, this class can record the transformations that take place instead of actually performing them, and the transformations can be applied to both (or one) of the original data sets. For printing, the recorded actions can be replayed on the first 10-25 rows rather than the entire data set.
This also means we can consistently expect to be able to reference the original values in reporting and other outputs.
Dispatch on the method of the first argument. Wherever possible, functions in
different
should be able to be applied to adiff_pair
ordiff_tbl
class OR to anx
,y
pair of data frames. (If...
is used in a function designed primarily fordiff_pair
, then they
argument is named.y
and needs to be named explicitly.)Resolving Column Matches
Functions with the
diff_cols_
prefix help resolve column matching issues. Column name mismatches are resolved using the functiondiff_cols_match(z, ..., .y = NULL)
to register the final meshed column names.Transitioning into modifying the values (or mutating the input data), we could also have a function
diff_cols_type()
that coerces (matched) columns to a particular type. In general, this would be useful only for data types that can be whole-sale coerced, e.g.integer
tocharacter
.This process could be implicitly incorporated into
diff_cols_match()
via a.coerce_type = TRUE
argument that would coerce matched columns into the same type, using the type of the column with precedence or the column fromx
.Finally, the user can also specifiy that certain columns should be dropped altogether.
Resolving Row Alignment
Row alignment is required for unordered data frames or data frames with differeing numbers of rows.
Key variables can be specified in several places:
Note that the keys reference the meshed column names:
tidyselect column name semantics
Column names should match
dplyr
+tidyselect
semantics/feel:select()
style:diff_cols_select()
diff_cols_keys()
rename()
style (ish):diff_cols_match()
Value Normalization
This portion of the API should help with two things:
This functionality is differentiated from
diff_cols_guess()
in the sense that these functions help to resolve encoding issues in columns that are known (or have been confirmed) to be matched.This provides a list of tibbles with
value.x
andvalue.y
for matching rows for each column. The user can explore these tibbles, or they can use additional helper functions likediff_values_plot()
anddiff_values_count()
.The value combination counts hint at the equivalence map, which we can get by observing any 1:1, 1:n or n:1 mappings.
This function returns any 1:1 mappings in a 1:1 or (1 of n):1 format, where
set
tracks the source data set ofvalue
and itsequivalent
value in the other data set. A potential extension would be to include a.fuzzy
argument that could be used for approximate string matching to catch values that are equivalent up to a small typo.These mappings, once verified, are then applied to the
diff_pair
:Overall Workflow
To summarize what an idealized workflow might look like, here's a purely made-up example where some work is required to match columns and values.
I'm not sure about how
diff_pair
anddiff_tbl
should interact. Shoulddiff_compare()
"freeze" the comparison? Or shoulddiff_pair
have adiff_tbl
slot that it updates ondiff_compare()
?One argument for the need for
diff_compare()
is that it "finalizes" the row-wise alignment. Whereas thediff_pair
object doesn't need to have two data frames aligned by rows, the comparison is meaningless under certain conditions if the alignment isn't completed. The user might need to do some wrangling to get the alignment to work -- maybe adiff_check_alignment()
is needed? -- sodiff_pair
lets the difference comparison be suspended in state until we're ready to run the comparison (like a Schrodinger tibble).Thoughts and suggestions welcome cc @tgerke
The text was updated successfully, but these errors were encountered: