Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

different: Features and API design #7

Open
gadenbuie opened this issue Aug 17, 2018 · 4 comments
Open

different: Features and API design #7

gadenbuie opened this issue Aug 17, 2018 · 4 comments
Labels
planning 🌱 Development planning
Milestone

Comments

@gadenbuie
Copy link
Member

gadenbuie commented Aug 17, 2018

  • Current State of the diff_tbl class
    • print
    • summary
    • plot
    • Helpers
  • Desired API
    • Missing Pieces
    • Behind the Scenes
    • Resolving Column Matches
    • Resolving Row Alignment
    • tidyselect column name semantics
    • Value Normalization
  • Overall Workflow

Current State

There are two main features of different that have been implemented to date: diff_compare(x, y) generates a diff_tbl object that contains a tidy summary of the differences between x and y, and diff_report(x, y) creats a self contained HTML report.

Behind the scenes, a major component of the differencing process is the alignment of the dataframe rows, which is handled by different:::align_data_frames(). Column alignment is exclusively by matching column names.

diff_tbl class

The use of the diff_tbl class is limited primarily to implementing methods for the following generics.

print

print.diff_tbl gives simple output like

> diff_obj
<diff_tbl: x vs y>There were 156 differences across 8 cols and 86 rows

or

> diff_compare(x, y2)
<diff_tbl: x vs y2>There were no differences found between `x` and `y2`

or

> diff_compare(x, y2)
<diff_tbl: x vs y2>There were no differences in overlapping columns between `x` and `y2`

summary

summary.diff_tbl gives a more detailed view of the differences between x and y.

> diff_compare(x, y) %>% summary()
Output
── different: Comparison Summary ─────────────────────────────────────────
# Dimensions
    set    rows  cols
    ----- ----- -----
    x       100    11
    y       100    10

# Columns
● x has 1 unique column:
    `colname_02`
● There are 10 columns that appear in both
  ✔ 2 cols are identical: 
    `id_01`, `id_02`
  ✖ 8 cols have differences: 
    `colname_01`, `colname_03`, 
    `colname_04`, `colname_05`, `colname_06`, 
    `colname_07`, `colname_09`, 
    `colname_10`

# Differences
❯ There were 156 differences across 8 cols and 86 rows
    variable   type.x    type.y    state    miss_count diff             
    ----      -----     -----     -----         ----- ------           
    colname_01 character character diff             20 <tibble [20 × 5]>
    colname_03 integer   character diff             20 <tibble [20 × 5]>
    colname_04 integer   integer   diff             20 <tibble [20 × 5]>
    colname_05 character character diff             19 <tibble [19 × 5]>
    colname_06 numeric   numeric   diff             20 <tibble [20 × 5]>
    colname_07 character character diff             20 <tibble [20 × 5]>
    colname_09 factor    factor    diff             17 <tibble [17 × 5]>
    colname_10 character character diff             20 <tibble [20 × 5]>
    id_01      integer   integer   same              0 <NULL>           
    id_02      character character same              0 <NULL>           
    colname_02 character <NA>      unique_x         NA <NULL>  

or

> diff_compare(x, y2) %>% summary()
Output
── different: Comparison Summary ─────────────────────────────────────────
# Dimensions
    set    rows  cols
    ----- ----- -----
    x       100    11
    y2      100     9

# Columns
● x has 2 unique columns:
    `colname_04`, `colname_05`
● There are 9 columns that appear in both
  ✔ 9 cols are identical: 
    `colname_01`, `colname_02`, 
    `colname_03`, `colname_06`, `colname_07`, 
    `colname_09`, `colname_10`, `id_01`, 
    `id_02`

# Differences
✔ There were no differences in overlapping columns between `x` and `y2`
    variable   type.x    type.y    state    miss_count diff  
    ----      -----     -----     -----         ----- ------
    colname_01 character character same              0 <NULL>
    colname_02 character character same              0 <NULL>
    colname_03 integer   integer   same              0 <NULL>
    colname_06 numeric   numeric   same              0 <NULL>
    colname_07 character character same              0 <NULL>
    colname_09 factor    factor    same              0 <NULL>
    colname_10 character character same              0 <NULL>
    id_01      integer   integer   same              0 <NULL>
    id_02      character character same              0 <NULL>
    colname_04 integer   <NA>      unique_x         NA <NULL>
    colname_05 character <NA>      unique_x         NA <NULL>

plot

plot.diff_tbl gives a visual summary of differences by row, where differences (misses) are encoded in red, and values that exist only in the x or y data frames are colored blue or green.

Helpers

There are also helper methods for as_tibble() (convert to standard tibble) and metadata() (extract diff_meta from the attributes of the diff_tbl object).

Desired API

Missing Pieces

The current workflow requires that rows and columns of the two data frames can be matched easily by matching keys and column names.

Depending on the "distance" between the two data sets, getting both sides to the point where they "snap together" can be a significant data wrangling burden. It would be nice to develop a workflow that assists in this process in a more explicit manner and with the following tasks:

  1. Matching columns

    • Fuzzy string matching of names may help
    • Matching by distribution of values in the columns
  2. Row alignment

    • Allow user to specify keys on both sides
    • or provide guess for keys?
  3. Value normalization

    • Identify recoding of values in matched columns
    • Identify dropped values or levels

Each of the above steps can be handled with standard dplyr processing, but there are two key points of friction that the different API can help alleviate.

  1. Consolidate the exploratory code into helper functions that run repeated tests across columns, for example in identifying the equivalence maps between values in an x column and a y column. This process will still require manual decision making but different can streamline the exploration. This is also, in part, where the embedded Shiny app comes into play.

  2. Track all of the dimensions of difference between the two data frames. The pre-processing steps required to reconcile the two data frames are, themselves, part of the differences between the two data frames. But diff_compare() doesn't know about these steps and thus can only report on the differences identified between the processed x and y data.

    Ideally, by involving different in the reconciliation process, we can track and report these steps in addition to the final differences between the two objects.

Behind the Scenes

Link the data early in a new diff_pair class. Ideally, the two data frames can be linked together immediately after import. By relying on R's copy-on-modify semantics, we can reference the original data in the new class.

x <- read_csv("data_set_1.csv")
y <- read_tsv("data_set_2.tsv")
z <- diff_pair(x, y)

Internally, this class can record the transformations that take place instead of actually performing them, and the transformations can be applied to both (or one) of the original data sets. For printing, the recorded actions can be replayed on the first 10-25 rows rather than the entire data set.

This also means we can consistently expect to be able to reference the original values in reporting and other outputs.

Dispatch on the method of the first argument. Wherever possible, functions in different should be able to be applied to a diff_pair or diff_tbl class OR to an x, y pair of data frames. (If ... is used in a function designed primarily for diff_pair, then the y argument is named .y and needs to be named explicitly.)

diff_pair(x, y) %>% diff_compare()
#> <diff_tbl: x vs y>
#> ❯ There were 156 differences across 8 cols and 86 rows

diff_compare(x, y)
#> <diff_tbl: x vs y>
#> ❯ There were 156 differences across 8 cols and 86 rows

Resolving Column Matches

Functions with the diff_cols_ prefix help resolve column matching issues. Column name mismatches are resolved using the function diff_cols_match(z, ..., .y = NULL) to register the final meshed column names.

z <- diff_pair(x, y)

diff_cols_common(z)
#> `colname_01`, `colname_02`, ...

diff_cols_unique(z)
#> x has 1 unique column:
#>   `colname_03`
#> y has 1 unique column:
#>   `colname_04`

diff_cols_fuzzy(z)
#> Fuzzy string matching suggests x -> y column matches:
#>  "colname_05" = "colname 05",
#>  "column 6" = c("column_06", "cols 6")

# ...above output could also be a tibble

diff_cols_guess(z)
#> ...returns a tibble with columns:
#> - x: column name from x
#> - best_match: column name from y
#> - match_reason:
#>     - exact name
#>     - value test
#>     - fuzzy name match

z <- diff_cols_match(z,
  colname_05 = c("column 05", "colname 05"),         # rename mismatched columns
  colname_05 = c(x = "column 05", y = "colname 05"), # ...same as above
  c(y = "cols 6", x = "column 6"),                   # let order give precedence
  "cols 6" = c("column 6", "cols 6"),                # ...same as above
  colname_new_04 = "colname_04"         # rename columns from x,y simultaneously
)

# If no parameters given to diff_cols_match(), 
# use .method = c("strict", "fuzzy", "guess")
z <- diff_cols_match(z, .method = "guess")
#> ...print messages about how matches were guessed...

Transitioning into modifying the values (or mutating the input data), we could also have a function diff_cols_type() that coerces (matched) columns to a particular type. In general, this would be useful only for data types that can be whole-sale coerced, e.g. integer to character.

z <- diff_cols_type(z, "colname_05" = "double", "colname_06" = "character")

This process could be implicitly incorporated into diff_cols_match() via a .coerce_type = TRUE argument that would coerce matched columns into the same type, using the type of the column with precedence or the column from x.

Finally, the user can also specifiy that certain columns should be dropped altogether.

# diff_cols_select <- function(z, ..., .which = c("both", "x", "y"))
z <- diff_cols_select(z, starts_with("colname"))

Resolving Row Alignment

Row alignment is required for unordered data frames or data frames with differeing numbers of rows.

Key variables can be specified in several places:

z <- diff_pair(x, y, keys = c("id_01", "id_02"))

z <- diff_cols_keys(z, id_01, id_02)

# Or use left = right like in `full_join()`, left takes precedence
z <- diff_cols_keys(z, id_11 = id_01, id_02)

diff_compare(z, keys = c("id_01"))

Note that the keys reference the meshed column names:

diff_pair(x, y, keys = "id_01") %>%
  diff_cols_match(id_99 = "id_01")
#> Error: key `id_01` not found in `x` or `y`. Did you mean `id_99`?

diff_pair(x, y, keys = "id_99") %>%
  diff_cols_match(id_99 = "id_01")
#> will work!

tidyselect column name semantics

Column names should match dplyr + tidyselect semantics/feel:

  • select() style:
    • diff_cols_select()
    • diff_cols_keys()
  • rename() style (ish):
    • diff_cols_match()

Value Normalization

This portion of the API should help with two things:

  • surfacing equivalence maps between two columns in the presence of 1:1 or minor modifications
  • modifying the values in columns of one or both data sets according to the discovered mappings.

This functionality is differentiated from diff_cols_guess() in the sense that these functions help to resolve encoding issues in columns that are known (or have been confirmed) to be matched.

diff_values(z)
#> $colname_01
#> # A tibble: 6 x 5
#>   variable   value.x value.y id_01 id_02 
#>   <chr>      <chr>   <chr>   <int> <chr> 
#> 1 colname_01 xikarq  xikarq     77 ldhuft
#> 2 colname_01 qdoyuv  qdoyuv      2 gkmols
#> 3 colname_01 zapizm  zapizm     53 izygxg
#> 4 colname_01 aknhkr  aknhkr     24 pmxpow
#> 5 colname_01 zamsxb  zamsxb      9 hpwqhh
#> 6 colname_01 hhgfxp  hhgfxp     15 dtjden
#> 
#> $colname_03
#> # A tibble: 6 x 5
#>   variable   value.x value.y id_01 id_02 
#>   <chr>      <chr>   <chr>   <int> <chr> 
#> 1 colname_03 -24     -8         77 ldhuft
#> 2 colname_03 -8      -8          2 gkmols
#> 3 colname_03 -19     6          53 izygxg
#> 4 colname_03 26      26         24 pmxpow
#> 5 colname_03 -46     19          9 hpwqhh
#> 6 colname_03 14      14         15 dtjden

# or
diff_values(z, colname_03)
#> # A tibble: 6 x 5
#>   variable   value.x value.y id_01 id_02 
#>   <chr>      <chr>   <chr>   <int> <chr> 
#> 1 colname_03 -24     -8         77 ldhuft
#> 2 colname_03 -8      -8          2 gkmols
#> ...3

This provides a list of tibbles with value.x and value.y for matching rows for each column. The user can explore these tibbles, or they can use additional helper functions like diff_values_plot() and diff_values_count().

diff_values_plot(z)

#...same as
diff_values(z) %>% 
  imap(~ ggplot(.x) + aes(value.x, value.y) + geom_point() + ggtitle(.y)) %>% 
  cowplot::plot_grid(plotlist = .)

diff_values_count(z, "colname_13")
#> $colname_13
#> # A tibble: 4 x 3
#>   value.x value.y     n
#>     <int> <chr>   <int>
#> 1       1 a          26
#> 2       2 b          31
#> 3       3 c          16
#> 4       4 d          27

The value combination counts hint at the equivalence map, which we can get by observing any 1:1, 1:n or n:1 mappings.

diff_values_equivalent(z, "colname_13")
#> $colname_13
#> A tibble: 4 x 3
#>   set   value equivalent
#>   <chr> <int> <chr>     
#> 1 x         1 a         
#> 2 x         2 b         
#> 3 x         3 c         
#> 4 x         4 d

This function returns any 1:1 mappings in a 1:1 or (1 of n):1 format, where set tracks the source data set of value and its equivalent value in the other data set. A potential extension would be to include a .fuzzy argument that could be used for approximate string matching to catch values that are equivalent up to a small typo.

These mappings, once verified, are then applied to the diff_pair:

# ---- Columns from one side, pre column matching ----
# re-map values for the `x` set
diff_values_mutate(z, colname_13 = letters[colname_13], .which = "x")

# or for y
diff_values_mutate(z, colname_13 = which(letters == colname_13), .which = "y")

# ---- Both columns, post column matching ----
diff_values_mutate(z, vital_status = as.integer(vital_status == "ALIVE"))

# ---- Automatic mapping ----
diff_values_mutate(z, .map = diff_values_equivalent(z, "colname_13))

Overall Workflow

To summarize what an idealized workflow might look like, here's a purely made-up example where some work is required to match columns and values.

x <- read_csv("data_set_1.csv")
y <- read_tsv("data_set_2.tsv")
z <- diff_pair(x, y)

# Check columns
diff_cols_unique(z)
diff_cols_guess(z)

z <- diff_cols_match(z
  colname_05 = c("Column 5", "colname 05"),
  colname_06 = c(y = "cols 6", x = "column 6"),
  colname_new_04 = "colname_04"
) %>%
  diff_cols_select(contains("id|colname")) %>%
  diff_cols_keys(id_01, id_02)

# Check values
diff_values_plot(z)
diff_values_equivalent(z, colname_13)

z <- z %>%
  diff_values_mutate(colname_04 = recode(colname_04, "old_x" = "new"), .which = "x") %>%
  diff_values_mutate(colname_04 = recode(colname_04, "old_y" = "new"), .which = "y") %>%
  diff_values_mutate(colname_06 = colname_06 == "YES")

# printing a diff_pair object runs a truncated version of diff_compare
z
#> <diff_tbl - preview: x vs y>
#> ❯ There were 73 differences across 8 cols and 21 rows in the first 100 rows

diff_x_y <- diff_compare(z)
diff_x_y
#> <diff_tbl: x vs y>
#> ❯ There were 156 differences across 8 cols and 86 rows

I'm not sure about how diff_pair and diff_tbl should interact. Should diff_compare() "freeze" the comparison? Or should diff_pair have a diff_tbl slot that it updates on diff_compare()?

One argument for the need for diff_compare() is that it "finalizes" the row-wise alignment. Whereas the diff_pair object doesn't need to have two data frames aligned by rows, the comparison is meaningless under certain conditions if the alignment isn't completed. The user might need to do some wrangling to get the alignment to work -- maybe a diff_check_alignment() is needed? -- so diff_pair lets the difference comparison be suspended in state until we're ready to run the comparison (like a Schrodinger tibble).

Thoughts and suggestions welcome cc @tgerke

@gadenbuie gadenbuie added the planning 🌱 Development planning label Aug 17, 2018
@gadenbuie gadenbuie added this to the v0.1 milestone Aug 17, 2018
@tgerke
Copy link
Member

tgerke commented Aug 24, 2018

  • Would an option for row/column alignment by ordering (rather than by colname matching, for example) ever be useful? I'm thinking the output could be helpful even in the context of simple matrix objects, not just tibbles.
  • In output for summary might be good to indicate "Entries in 2 cols are identical". There are a lot of things being compared (dimensions, column/row names, types) in addition to actual values, so probably good to be clear.
  • I have trouble figuring out what the variable miss_count is supposed to represent, though I suspect "miss" is just being used as a synonym for "difference." If that's the case, what about diff_count instead?

@gadenbuie
Copy link
Member Author

  • Handle matrix inputs as well as tibbles (colnames will be X1, X2 style and can use rownames if available). Will need to add a guard function that promote matrices to tibbles.

  • When IDs are given for the row alignment, checks occur to make sure these are not duplicated in either data frame. These error messages should be as helpful as possible, possibly even suggesting the right filter(x, ...) code to find the duplicated rows.

  • Additional column matching ideas:

    • include checks on actual value equivalence for a small sample of values
    • matching frequency tables for categorical variables
    • equal summary statistics (min, q0.25, q0.5, q0.75, max)
    • equal frequency of binned values (numeric values) geom_boxplot() style
  • diff_cols_match() should be able to also work with a tibble of x, y, matched column names

  • An additional output, ancillary to diff_compare() would be the dplyr code to process x to match y and y to match x. Or to just output the matched tibbles as they would be just prior to diff_compare().

@gadenbuie
Copy link
Member Author

gadenbuie commented Aug 31, 2018

  • diff_pair()
  • diff_cols_common()
  • diff_cols_unique()
  • diff_cols_type()
  • diff_cols_guess()
  • diff_cols_match()
  • Add diff_pair.list() method that takes list of length 2 and creates diff_pair and uses list names if available
  • Add diff_pair.matrix() method
  • diff_cols_unique() and diff_cols_common() need better logic when no unique/common columns
  • Fix 1 column have differences: in summary printout
  • Add input object classes to metadata
  • Add diff_pull() to pull out variables from a diff_pair or diff_tbl.
  • Add diff_plot_variable() to plot values of x and y data as points with x values on x-axis, etc.

@gadenbuie
Copy link
Member Author

What about remote tables? Like using dbplyr to compare two tables living in separate remote databases.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
planning 🌱 Development planning
Projects
None yet
Development

No branches or pull requests

2 participants