Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Json extract #524

Merged
merged 9 commits into from
Jan 3, 2025
Merged

Json extract #524

merged 9 commits into from
Jan 3, 2025

Conversation

madejejej
Copy link
Contributor

@madejejej madejejej commented Dec 20, 2024

Implements the json_extract function.

In the meantime, the json path has already been implemented by @petersooley in #555 which is a requirement for json_extract.

However, this PR takes a different approach and parses the JSON path using the JSON grammar, because there are a lot of quirks in how a JSON key can look (see the JSON grammar in the Pest file).

The downside is that it allocates more memory than the current implementation, but might be easier to maintain in the long run.

I included a lot of tests with some quirky behavior of the json_extract (some of them still need some work). I also noticed that these changed between sqlite versions (had SQLite 3.43.2 locally and 3.45 gave different results). Due to this, I'm not sure how much value there is in trying to be fully compatible with SQLite. Perhaps the approach taken by @petersooley solves 99% of use-cases?

@madejejej
Copy link
Contributor Author

I hoped this feature wouldn't drop me into a rabbit hole 😅 Unfortunately, there are some quirks in the JSON syntax, ex:

-- "\x61" is an escaped ASCII character for 'a', but this doesn't match in neither sqlite nor postgres:
SELECT json_extract('{"\x61": 1}', 'a');

-- you need to use the exact sequence of characters:
SELECT json_extract('{"\x61": 1}', '\x61')
1

-- the other way around also wouldn't work:
SELECT json_extract('{"a": 1}', '\x61')


#[derive(Parser)]
#[grammar_inline = r#"
array_locator = @{ "[" ~ array_index ~ "]" }
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to copy a large part of grammar from core/json/de.rs. Perhaps I could extract the grammar from there into a constant and add additional rules for parsing the path? However, this would leave us with a few unused rules.

The other way to DRY it out would be to extract only the common rules, but I feel like it's brittle and very unhandy if we ever have to patch the JSON grammar.

@jussisaurio
Copy link
Collaborator

i merged #504 so base is now main

@madejejej madejejej force-pushed the json-extract branch 2 times, most recently from cc089af to 332b831 Compare December 31, 2024 08:40
@madejejej
Copy link
Contributor Author

@petersooley I saw your PR with json_array_length got merged while I was working on json_extract.

I like the idea of a smaller, hand-rolled parser but I also think the JSON grammar is pretty complicated, even in terms of what you can do with the JSON path in SQLite (see some tests attached). Let me know your thoughts.

@madejejej madejejej force-pushed the json-extract branch 3 times, most recently from 0b639f4 to d970d00 Compare December 31, 2024 08:52
@madejejej
Copy link
Contributor Author

Test FAILED: 'SELECT json_extract(1, null, null, null)'
returned ''
expected '[null,null,null]'

SQLite seems to be changing some quirky behavior from version to version:

sqlite> .version
SQLite 3.43.2 2023-10-10 13:08:14 1b37c146ee9ebb7acd0160c0ab1fd11017a419fa8a3187386ed8cb32b709aapl
zlib version 1.2.12
clang-15.0.0 (64-bit)
sqlite> SELECT json_extract(1, null, null, null);
[null,null,null]

@madejejej madejejej marked this pull request as ready for review December 31, 2024 09:00
core/json/json_path.rs Outdated Show resolved Hide resolved
testing/json.test Outdated Show resolved Hide resolved
core/json/mod.rs Outdated Show resolved Hide resolved
@petersooley
Copy link
Contributor

@petersooley I saw your PR with json_array_length got merged while I was working on json_extract.

I like the idea of a smaller, hand-rolled parser but I also think the JSON grammar is pretty complicated, even in terms of what you can do with the JSON path in SQLite (see some tests attached). Let me know your thoughts.

@madejejej The path parsing is much simpler than the JSON parsing, for sure. It's also very limited in sqlite (i.e. no glob/wildcard patterns). It's mostly array indexes and object property paths with a few extra cases.

What I like about the hand-rolled solution is that it doesn't separate out the path parsing from accessing the value in the JSON. That allows returning early as soon as the JSON value has no match for the path. No matter which way we go, there's always a loop required to drill into the JSON and extract a value at the end of the given path. Both solutions are doing that loop anyway, it's just that the hand-rolled solution is doing it during path parsing.

@jussisaurio
Copy link
Collaborator

@petersooley I saw your PR with json_array_length got merged while I was working on json_extract.
I like the idea of a smaller, hand-rolled parser but I also think the JSON grammar is pretty complicated, even in terms of what you can do with the JSON path in SQLite (see some tests attached). Let me know your thoughts.

@madejejej The path parsing is much simpler than the JSON parsing, for sure. It's also very limited in sqlite (i.e. no glob/wildcard patterns). It's mostly array indexes and object property paths with a few extra cases.

What I like about the hand-rolled solution is that it doesn't separate out the path parsing from accessing the value in the JSON. That allows returning early as soon as the JSON value has no match for the path. No matter which way we go, there's always a loop required to drill into the JSON and extract a value at the end of the given path. Both solutions are doing that loop anyway, it's just that the hand-rolled solution is doing it during path parsing.

Yeah this is a decent point, we'd save some deserialization overhead by traversing the JSON object on demand while parsing the path. Maybe not the most important thing in the world and can be optimized later, but it's also nice to implement things right the first time around. I wouldn't block this PR from going forward with the eagerly-parsed version so I'll let @madejejej decide

@madejejej
Copy link
Contributor Author

What I like about the hand-rolled solution is that it doesn't separate out the path parsing from accessing the value in the JSON. That allows returning early as soon as the JSON value has no match for the path

I agree that it feels better. However we should be able to parse any valid JSON key, which in the JSON grammar is defined as:

key = _{ identifier | string }

The json.pest file has a lot of grammar to define the identifier and the string, which makes me think the hand-rolled parser might be harder to maintain and possibly more buggy over time.

@madejejej madejejej requested a review from jussisaurio January 2, 2025 08:19
Copy link
Collaborator

@jussisaurio jussisaurio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fine to roll with this now, we can optimize the impl later with handrolled parsing it if really becomes necessary

@jussisaurio jussisaurio merged commit a934ead into tursodatabase:main Jan 3, 2025
36 checks passed
@madejejej madejejej mentioned this pull request Jan 8, 2025
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants