Skip to content

[PATHFINDING] Parse json as variant #7403

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

scovich
Copy link
Contributor

@scovich scovich commented Apr 10, 2025

This is a pathfinding exercise, to see how easy/hard it might be to parse JSON text into parquet's new variant type, using the tape decoder. Not intended to merge, it is more of a conversation starter.

In particular:

  • It would be better to leverage a general variant library for variant bit-wrangling instead of doing it all manually here.
  • TBD Where/how to expose this functionality through a public API
  • Still TBD how to assemble a bunch of variant metadata+value pairs inside an arrow array data that can eventually become a usable arrow array
  • For comparison, the same exercise is repeated using serde_json instead. This would almost certainly not belong in an actual contribution to arrow-json.

@github-actions github-actions bot added the arrow Changes to the arrow crate label Apr 10, 2025
@scovich
Copy link
Contributor Author

scovich commented Apr 10, 2025

Attn @alamb

@alamb
Copy link
Contributor

alamb commented Apr 11, 2025

See also the related PR for variant here:

@alamb
Copy link
Contributor

alamb commented Apr 11, 2025

Thank you for this PR @scovich

It would be better to leverage a general variant library for variant bit-wrangling instead of doing it all manually here.

TBD Where/how to expose this functionality through a public API

In my mind this functionality feels like a "computation kernel" (aka similarly to the functions in https://docs.rs/arrow/latest/arrow/compute/index.html)

The signature seems like it would roughly be something like:

/// Covert text stored as JSON in an input `StringArray`, `LargeStringArray` or `StringViewArray` into
/// a single "Variant" array (`StructArray` with an extension type)
fn json_to_variant(input: &ArrayRef) -> ArrayRef {
 ...
}

Since the arrow-json crate is currently for converting JSON to arrow it is not 100% clear to me that this functionality belongs in the arrow-json crate at all, espcially as variant is not part of the "core" arrow spec it seems.

Still TBD how to assemble a bunch of variant metadata+value pairs inside an arrow array data that can eventually become a usable arrow array

I think we will sort this out as part of implementing varint in #6736. TLDR is via a StructArray annotated with an extension type I think.

@scovich
Copy link
Contributor Author

scovich commented Apr 16, 2025

TBD Where/how to expose this functionality through a public API

In my mind this functionality feels like a "computation kernel" (aka similarly to the functions in https://docs.rs/arrow/latest/arrow/compute/index.html)

Since the arrow-json crate is currently for converting JSON to arrow it is not 100% clear to me that this functionality belongs in the arrow-json crate at all, espcially as variant is not part of the "core" arrow spec it seems.

I agree something like arrow-compute makes a lot of sense. Unfortunately, the tape decoder machinery is private to arrow-json crate, so I had to do the initial pathfinding here. Is there a better way forward?

@alamb
Copy link
Contributor

alamb commented Apr 16, 2025

I agree something like arrow-compute makes a lot of sense. Unfortunately, the tape decoder machinery is private to arrow-json crate, so I had to do the initial pathfinding here. Is there a better way forward?

SOme other options might be (not sure which one we should go with):

  1. copy/paste the code to avoid a dependency
  2. refactor the tape machinery into a new crate that they can both depend on

I have been thinking a lot about how we should introduce variant. What do you think about a structure like this (crates)

  • variant: Core definition of the open variant type, no dependencies
  • arrow-variant: Arrow extension type for variant, including conversion to/from JSON and arrow arrays (e.g. a compute kernel, etc)

I think depending on how arrow-variant is implemented, maybe it depends directly on arrow-json and maybe expose relevant parts

@alamb
Copy link
Contributor

alamb commented Apr 18, 2025

I filed #7423 to track this item

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants