Skip to content
This repository was archived by the owner on May 17, 2024. It is now read-only.

Json matching & tests for sqeleton PR #15 #383

Merged
merged 4 commits into from
May 5, 2023

Conversation

nicolasaldecoa
Copy link
Contributor

Related to sqeleton PR # 15.

Tests

can someone help me out with the rest of the code needed for the tests to work?
This sqeleton branch has to be installed in order to test this PR.

Json comparison

I also have the following questions about decisions that have to be made regarding false positives/false negatives detection:

  • how can we pass a boolean to diff_sets to indicate whether there are any extra columns of jsontype and avoid checking that every time?

  • what should we do when we detect diffs that are caused by distinct serialization (diffs_are_equiv_jsons returns True) ? IMO the default behavior should be ignoring it, but some users might consider it a difference if the fields don't have exactly the same format (in postrgreSQL I'm pretty sure that this only makes sense with 'json' which keeps the format, but 'bjson' is serialized using a pretty_print function and doesn't care about the original json string that was inserted).

Thanks, hope this helps

@erezsh
Copy link
Contributor

erezsh commented Feb 8, 2023

Hi @nicolasaldecoa ,

The right way to go about it, imho, is to pass set_diff() with a list of the column types that are being diffed. Where it's called, inside _bisect_and_diff_segments(), you can access table1._schema to get them (which should be the same as table2._schema). You can the iterate the column types inside set_diff, and only for columns that are isinstance(..., JsonType) do we need to perform further parsing.

@erezsh
Copy link
Contributor

erezsh commented Feb 8, 2023

As for what to do when fields are the same but have different format, I suggest printing a warning, once per column. (per row would be too much)

@nicolasaldecoa
Copy link
Contributor Author

As for what to do when fields are the same but have different format, I suggest printing a warning, once per column. (per row would be too much)

That's a good idea, I'll implement that.

The right way to go about it, imho, is to pass set_diff() with a list of the column types that are being diffed. Where it's called, inside _bisect_and_diff_segments(), you can access table1._schema to get them (which should be the same as table2._schema). You can the iterate the column types inside set_diff, and only for columns that are isinstance(..., JsonType) do we need to perform further parsing.

great, I should be able to work with that.

If you are OK with it, I'll wait until you have the time to make a commit with what's left in the tests, I got a bit lost on that code.
thanks for the quick responses.

@erezsh
Copy link
Contributor

erezsh commented Feb 8, 2023

until you have the time to make a commit with what's left in the tests

Not sure what you mean?

The way you added the tests seems okay, on a cursory glance

@nicolasaldecoa
Copy link
Contributor Author

nicolasaldecoa commented Feb 8, 2023

Not sure what you mean?
The way you added the tests seems okay, on a cursory glance

Oh, I meant that because I wasn't able to get up the docker compose with the mock dbs locally, I didn't run any tests myself.
I think that _insert_to_table probably needs some custom code for the jsons to be inserted properly, no?
In sql this works:

postgres

INSERT INTO <table> (id, <json_col>) VALUES
       (id, 'json_string')
       [...]

INSERT INTO <table> (id, <jsonb_col>) VALUES
       (id, 'json_string')
       [...]

redshift

INSERT INTO <table> (id, <super_col>) VALUES
       (id, JSON_PARSE('json_string'))
       [...]

@erezsh
Copy link
Contributor

erezsh commented Feb 8, 2023

Ah, got it. But it should work with any database you connect to, doesn't have to use our own dockers. Just look at tests/common.py and change the URLs to whatever you want.

nicolasaldecoa added a commit to nicolasaldecoa/sqeleton that referenced this pull request Feb 8, 2023
@nicolasaldecoa
Copy link
Contributor Author

OK, thank you for the comments. I've updated both PRs (this one and the one in sqeleton; I think this should be ready to be reviewed.
Let me know what you think.
cheers!

@nicolasaldecoa
Copy link
Contributor Author

Hello Devs/Maintainers,
I've been working on a high-level package that uses data-diff under the hood, and the most difficult part has been making cross-database comparisons of json columns work. I'm doing that part as a post-processing step, and so far I've had to add quite a bit of sophistication to the json matching algorithm (recursivity to deal with fields that contain jsons encoded as strings, dealing with the precision of floating point numbers in json fields, etc.).

What this PR does is a start, but it only handles simple objects. I'm not sure if it is worth it to add a feature with these limitations, and otherwise adding a lot of custom logic can be cumbersome. The problem that we've encountered with this data-type is that many times there are differences, but those are just artifacts of how the DBs handle these objects and we just want to ignore them. But that's a bit subjective and can be hard to come up with a solution that works for many different users.

@nicolasaldecoa
Copy link
Contributor Author

nicolasaldecoa commented May 4, 2023

@nolar I forgot to tag you in my last comment. Since this is related to the work you've been doing on PR #533 ; maybe we can close this one and try to handle PG json & jsonb and RS super with your approach instead, because this one is also quite limited.

@nolar nolar merged commit 746b72e into datafold:master May 5, 2023
@nolar
Copy link
Contributor

nolar commented May 5, 2023

The broken tests are to be fixed in #553.

@nicolasaldecoa
Copy link
Contributor Author

The broken tests are to be fixed in #553.

Thank you, I was testing the last commit and realized that I had forgotten to modify the early return in the function when I included the warning-once-per-column feature. Opened a hotfix in #554 . sorry about that

# for free to subscribe to this conversation on GitHub. Already have an account? #.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants