Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

better diffs when most chars are ASCII #257

Closed
rogpeppe opened this issue Apr 29, 2021 · 1 comment · Fixed by #258
Closed

better diffs when most chars are ASCII #257

rogpeppe opened this issue Apr 29, 2021 · 1 comment · Fixed by #258

Comments

@rogpeppe
Copy link
Contributor

rogpeppe commented Apr 29, 2021

Here's a recent example diff from an actual test failure:

    to_test.go:773: got "org-4747474747474747,bucket-4242424242424242:m,tag1=a,tag2=aa,\xff=_value _value=2 11\norg-4747474747474747,bucket-4242424242424242:m,tag1=a,tag2=bb,\xff=_value _value=2 21\norg-4747474747474747,bucket-4242424242424242:m,tag1=b,tag2=cc,\xff=_value _value=1 21\norg-4747474747474747,bucket-4242424242424242:m,tag1=a,tag2=dd,\xff=_value _value=3 31\norg-4747474747474747,bucket-4242424242424242:m,tag1=c,\xff=_value _value=4 41\n"
    to_test.go:774: want "org-4747474747474747,bucket-4242424242424242:m,tag1=a,tag2=aa _value=2 11\norg-4747474747474747,bucket-4242424242424242:m,tag1=a,tag2=bb _value=2 21\norg-4747474747474747,bucket-4242424242424242:m,tag1=b,tag2=cc _value=1 21\norg-4747474747474747,bucket-4242424242424242:m,tag1=a,tag2=dd _value=3 31\norg-4747474747474747,bucket-4242424242424242:m,tag1=c _value=4 41\n"
    to_test.go:775: unexpected points result:
          string{
          	... // 29 identical bytes
          	0x32, 0x34, 0x32, 0x34, 0x32, 0x34, 0x32, 0x34, 0x32, 0x34, 0x32, 0x34, 0x32, 0x34, 0x32, 0x3a, //  |242424242424242:|
          	0x6d, 0x2c, 0x74, 0x61, 0x67, 0x31, 0x3d, 0x61, 0x2c, 0x74, 0x61, 0x67, 0x32, 0x3d, 0x61, 0x61, //  |m,tag1=a,tag2=aa|
        - 	0x2c, 0xff, 0x3d,                                                                               // -|,.=|
        + 	0x20,                                                                                           // +| |
          	0x5f, 0x76, 0x61, 0x6c, 0x75, 0x65,                                                             //  |_value|
        - 	0x20, 0x5f, 0x76, 0x61, 0x6c, 0x75, 0x65, 0x3d, 0x32, 0x20,                                     // -| _value=2 |
        + 	0x3d, 0x32, 0x20,                                                                               // +|=2 |
          	0x31, 0x31, 0x0a, 0x6f, 0x72, 0x67, 0x2d, 0x34, 0x37, 0x34, 0x37, 0x34, 0x37, 0x34, 0x37, 0x34, //  |11.org-474747474|
          	0x37, 0x34, 0x37, 0x34, 0x37, 0x34, 0x37, 0x2c, 0x62, 0x75, 0x63, 0x6b, 0x65, 0x74, 0x2d, 0x34, //  |7474747,bucket-4|
          	0x32, 0x34, 0x32, 0x34, 0x32, 0x34, 0x32, 0x34, 0x32, 0x34, 0x32, 0x34, 0x32, 0x34, 0x32, 0x3a, //  |242424242424242:|
          	0x6d, 0x2c, 0x74, 0x61, 0x67, 0x31, 0x3d, 0x61, 0x2c, 0x74, 0x61, 0x67, 0x32, 0x3d, 0x62, 0x62, //  |m,tag1=a,tag2=bb|
        - 	0x2c, 0xff, 0x3d,                                                                               // -|,.=|
        + 	0x20,                                                                                           // +| |
          	0x5f, 0x76, 0x61, 0x6c, 0x75, 0x65,                                                             //  |_value|
        - 	0x20, 0x5f, 0x76, 0x61, 0x6c, 0x75, 0x65, 0x3d, 0x32, 0x20, 0x32,                               // -| _value=2 2|
        + 	0x3d, 0x32, 0x20, 0x32,                                                                         // +|=2 2|
          	0x31, 0x0a, 0x6f, 0x72, 0x67, 0x2d, 0x34, 0x37, 0x34, 0x37, 0x34, 0x37, 0x34, 0x37, 0x34, 0x37, //  |1.org-4747474747|
          	0x34, 0x37, 0x34, 0x37, 0x34, 0x37, 0x2c, 0x62, 0x75, 0x63, 0x6b, 0x65, 0x74, 0x2d, 0x34, 0x32, //  |474747,bucket-42|
          	0x34, 0x32, 0x34, 0x32, 0x34, 0x32, 0x34, 0x32, 0x34, 0x32, 0x34, 0x32, 0x34, 0x32, 0x3a, 0x6d, //  |42424242424242:m|
          	0x2c, 0x74, 0x61, 0x67, 0x31, 0x3d, 0x62, 0x2c, 0x74, 0x61, 0x67, 0x32, 0x3d, 0x63, 0x63,       //  |,tag1=b,tag2=cc|
        - 	0x2c, 0xff, 0x3d, 0x5f, 0x76, 0x61, 0x6c, 0x75, 0x65,                                           // -|,.=_value|
          	0x20, 0x5f, 0x76, 0x61, 0x6c, 0x75, 0x65, 0x3d, 0x31, 0x20, 0x32, 0x31, 0x0a, 0x6f, 0x72, 0x67, //  | _value=1 21.org|
          	0x2d, 0x34, 0x37, 0x34, 0x37, 0x34, 0x37, 0x34, 0x37, 0x34, 0x37, 0x34, 0x37, 0x34, 0x37, 0x34, //  |-474747474747474|
          	0x37, 0x2c, 0x62, 0x75, 0x63, 0x6b, 0x65, 0x74, 0x2d, 0x34, 0x32, 0x34, 0x32, 0x34, 0x32, 0x34, //  |7,bucket-4242424|
          	0x32, 0x34, 0x32, 0x34, 0x32, 0x34, 0x32, 0x34, 0x32, 0x3a, 0x6d, 0x2c, 0x74, 0x61, 0x67, 0x31, //  |242424242:m,tag1|
          	0x3d, 0x61, 0x2c, 0x74, 0x61, 0x67, 0x32,                                                       //  |=a,tag2|
        - 	0x3d, 0x64, 0x64, 0x2c, 0xff, 0x3d, 0x5f, 0x76, 0x61, 0x6c, 0x75, 0x65,                         // -|=dd,.=_value|
        + 	0x3d, 0x64, 0x64,                                                                               // +|=dd|
          	0x20, 0x5f, 0x76, 0x61, 0x6c, 0x75, 0x65, 0x3d, 0x33, 0x20, 0x33, 0x31, 0x0a, 0x6f, 0x72, 0x67, //  | _value=3 31.org|
          	0x2d, 0x34, 0x37, 0x34, 0x37, 0x34, 0x37, 0x34, 0x37, 0x34, 0x37, 0x34, 0x37, 0x34, 0x37, 0x34, //  |-474747474747474|
          	0x37, 0x2c, 0x62, 0x75, 0x63, 0x6b, 0x65, 0x74, 0x2d, 0x34, 0x32, 0x34, 0x32, 0x34, 0x32, 0x34, //  |7,bucket-4242424|
          	0x32, 0x34, 0x32, 0x34, 0x32, 0x34, 0x32, 0x34, 0x32, 0x3a, 0x6d, 0x2c, 0x74, 0x61, 0x67, 0x31, //  |242424242:m,tag1|
        - 	0x3d, 0x63, 0x2c, 0xff, 0x3d, 0x5f, 0x76, 0x61, 0x6c, 0x75, 0x65,                               // -|=c,.=_value|
        + 	0x3d, 0x63,                                                                                     // +|=c|
          	0x20, 0x5f, 0x76, 0x61, 0x6c, 0x75, 0x65, 0x3d, 0x34, 0x20, 0x34, 0x31, 0x0a,                   //  | _value=4 41.|
          }

There are a very few 0xff bytes in there which are causing the whole thing to be rendered as a binary diff,
which is quite a bit harder to read.

Maybe there's a way of formatting diffs somehow that treats the non-ASCII bytes as exceptions rather than the rule.

@rogpeppe rogpeppe changed the title better diffs when most chars are ascii better diffs when most chars are ASCII Apr 29, 2021
@dsnet
Copy link
Collaborator

dsnet commented Apr 29, 2021

Thanks for the report!

At the cost of adding more complexity to the reporter (which I'm fine with, but others seem to not be), the logic can be smarter where it:

  • distinguishes between text and binary based on the ratio of valid to invalid UTF-8 characters,
  • automatically determines the most sensible delimiter (e.g., newline, comma, space, tab, colon, etc.).

I should note that the presence of invalid UTF-8 means that we can't use the triple-quote syntax since the invalid characters would be mangled in possibly indiscernible ways.

dsnet added a commit that referenced this issue Apr 30, 2021
The previous heuristic of treating strings as binary data
if it contains any invalid UTF-8 was too strict.
Loosen the heuristic to check if most of the characters
are printable text.

Fixes #257
dsnet added a commit that referenced this issue May 25, 2021
The previous heuristic of treating strings as binary data
if it contains any invalid UTF-8 was too strict.
Loosen the heuristic to check if most of the characters
are printable text.

Fixes #257
dsnet added a commit that referenced this issue May 25, 2021
The previous heuristic of treating strings as binary data
if it contains any invalid UTF-8 was too strict.
Loosen the heuristic to check if most of the characters
are printable text.

Fixes #257
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants