Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Add ExtensionType trait and CanonicalExtensionType enum #5822

Merged
merged 25 commits into from
Feb 2, 2025

Conversation

mbrobbel
Copy link
Contributor

@mbrobbel mbrobbel commented May 31, 2024

Rationale for this change

It would be nice to better support reading and writing the Arrow canonical uuid and json extension types with the arrow and parquet crate i.e. mapping between the arrow extension type and the parquet logical uuid and json types.

What changes are included in this PR?

This adds an ExtensionType trait, some impls for canonical extension types and a CanonicalExtensionType enum for canonical extension types.

Are there any user-facing changes?

Users can now annotate their fields with extension types, and for uuid and json they are propagated via the arrow writer to map to the parquet uuid and json logical types.

@github-actions github-actions bot added parquet Changes to the parquet crate arrow Changes to the arrow crate labels May 31, 2024
@kylebarron
Copy link
Contributor

Maybe ExtensionType could be a trait to be externally implementable and not limited to canonical extension types?

Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense to me, and seems like an unobtrusive way to provide better ergonomics for extension types.

That being said I've limited exposure to them so getting some broader perspectives might be valuable, perhaps on the mailing list or something?

@mbrobbel
Copy link
Contributor Author

I haven't had time to work on this, but I'm planning to pick this up later.

@alamb
Copy link
Contributor

alamb commented Jul 1, 2024

Thanks @mbrobbel -- marking this PR as draft as I think it still has planned but not yet completed work

@alamb alamb marked this pull request as draft July 1, 2024 19:17
@aykut-bozkurt
Copy link
Contributor

Thanks for the PR. This seems very useful to support not yet mapped logical types. e.g. json

@mbrobbel
Copy link
Contributor Author

mbrobbel commented Sep 26, 2024

I updated the PR to define a trait for extension types instead. Ready for another round of feedback.
Edit: just realized I need to change some trait methods to make it work for other extension types.

@mbrobbel mbrobbel force-pushed the parquet-uuid-schema branch from 6b883c5 to e35630a Compare January 17, 2025 22:49
@mbrobbel mbrobbel changed the title Add ExtensionType for uuid and map to parquet logical type Add ExtensionType trait and CanonicalExtensionType enum Jan 17, 2025
@mbrobbel mbrobbel marked this pull request as ready for review January 20, 2025 18:59
@mbrobbel
Copy link
Contributor Author

This is ready for another round of reviews.

I updated the trait and changed some field methods, added implementations for the current set of canonical extension types (behind a feature flag) and added more tests and docs. Maybe this PR is too big now, but I don't mind splitting it up if needed.

Roundtripping through parquet is not implemented in this PR.

@emilk
Copy link
Contributor

emilk commented Jan 22, 2025

This looks great to me! I really like this approach

Copy link
Contributor

@jayzhan211 jayzhan211 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks pretty nice!

Copy link
Contributor

@kylebarron kylebarron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops forgot to commit changes

Comment on lines +2253 to +2258
// TODO: roundtrip
// let arrow_schema = parquet_to_arrow_schema(&parquet_schema, None)?;
// assert_eq!(
// arrow_schema.field(0).try_extension_type::<Json>()?,
// Json::default()
// );
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this left for a future PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this requires a bit more work to propagate correctly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

///
/// // We just return a reference to the Uuid version.
/// fn metadata(&self) -> &Self::Metadata {
/// &self.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My first go at implementing this trait I used:

pub struct PointType(CoordType, Dimension);

impl ExtensionType for PointType {
    const NAME: &'static str = "geoarrow.point";

    type Metadata = (CoordType, Dimension);

    fn metadata(&self) -> &Self::Metadata {
        &(self.0, self.1)
    }
}

But that won't work for this because cannot return reference to temporary value.

So essentially type Metadata needs to be the same as whatever is defined inside the extension type struct, and that needs to be a single item, so that it can return Self::Metadata without creating any new items?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the current signature requires that. We could change it to return Self::Metadata?

The pattern I used for the canonical extension types with metadata applied to your example:

pub struct PointType(PointTypeMetadata);

pub struct PointTypeMetadata(CoordType, Dimension);

impl ExtensionType for PointType {
    const NAME: &'static str = "geoarrow.point";

    type Metadata = PointTypeMetadata;

    fn metadata(&self) -> &Self::Metadata {
        &self.0
    }
}

///
/// [`Field`]: crate::Field
/// [`Field::metadata`]: crate::Field::metadata
fn serialize_metadata(&self) -> Option<String>;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is interesting because it requires the type to contain all information to construct its Arrow extension metadata. This is not how I've implemented it so far in geoarrow-rs. For example, I define a Point type as parameterized only by its "coordinate type" (i.e. FixedSizeList coordinates or Struct[x: Float, y: Float] coordinates) and its dimension (XY, XYZ, and so on). Separately, I store more metadata, and each array type stores both pieces of data.

So the data type alone is not enough information to serialize the full GeoArrow metadata, which also needs the Coordinate Reference System (CRS). It's a good question whether the CRS should be a part of the type here. Perhaps it should, but then we need to support CRS equality, which is quite tricky to do, so a single logical CRS can be represented in different ways.

cc @paleolimbot , this might be a question for you.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah this is great, I'll take a better look at this tomorrow and figure out if the current definition works with GeoArrow.
Note that in the spec extension types are defined for Field, not for Array.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I consider the CRS a property of the type in geoarrow-c and geoarrow-python (which is also how it is implemented in the draft Parquet LogicalType, GeoParquet, and the draft Iceberg spec). My reading of this is that it's compatible with GeoArrow (and will be great!). (A Python implementation is here: https://github.com/geoarrow/geoarrow-python/blob/main/geoarrow-types/src/geoarrow/types/type_pyarrow.py ).

It's true that CRS equality is thorny (generally it's expressed as a probability)...I implement this as part of the type and implement == on the type.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like all the pieces are there (in geoarrow-rs), so I guess it will be straightforward to add ExtensionType implementations that can be used with the added extension type related methods of Field.

Copy link

@Tpt Tpt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just passing by and had a lot of fun reading this code (thank you)!

Please just ignore my comments if you feel so.

impl ExtensionType for Bool8 {
const NAME: &'static str = "arrow.bool8";

type Metadata = &'static str;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe bad idea: use () here. This way it's not possible to encode bad metadata value (any string that is not empty)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the proposed definition of the trait, () would indicate no metadata and results in the metadata key to be in unset in the field custom metadata map. Note that there is no actual storage for the metadata in this extension type. Both serialization and deserialization make sure the metadata is always an empty string.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok! Missed that! Thanks!

"" => Ok(""),
_ => Err(ArrowError::InvalidArgumentError(ERR.to_owned())),
},
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

slightly shorter:

Suggested change
)
if metadata.map_or(false, str::is_empty) {
Ok("")
} else {
Err(ArrowError::InvalidArgumentError("Bool8 extension type expects an empty string as metadata".into()))
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

///
/// This type does not have any parameters.
///
/// Metadata is either an empty string or a JSON string with an empty
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on implementation I guess it's not "an empty string" but "a JSON string with an empty string" (ie "\"\"" and not "")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! 29a94cb

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe bad idea: instead of an Option<Map<...>> what about just using a is_empty_object: bool field or something like this to make invalid states unrepresentable?

Also, you might even hack something like this to only deserialize the empty object:

#[derive(Deserialize)]
struct Empty {}

let is_empty_struct = serde_json::from_str<Empty>(value).is_ok()

and then use the Option<Empty> type in JsonMetadata

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did consider doing this. The spec states (https://arrow.apache.org/docs/format/CanonicalExtensions.html#json):

Description of the serialization:
Metadata is either an empty string or a JSON string with an empty object. In the future, additional fields may be added, but they are not required to interpret the array.

So I figured, I'll support the future use-case, however, if/when in the future fields are added here it would actually be better to use a strongly typed definition. I like your suggestion of using Empty here and since the metadata content currently is not pub, if fields are added in the future, we can just rename and modify the definition of the inner metadata type without introducing a breaking change.

c6f0443

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! It might be possible to also use a similar approach for the Bool8 metadata.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is currently no metadata storage for Bool8, and the difference is that there is no note about future additions to the metadata serialization:

Description of the serialization:
Metadata is an empty string.

So, if there are breaking changes in the spec for this type, it would have to be a breaking change here too anyway.

@alamb
Copy link
Contributor

alamb commented Jan 24, 2025

I think we should leave this PR open at least through the weekend to gather additional feedback. I think given the size/scope of it perhaps we can plan to merge it after I cut the release candidate for the next arrow-rs release (mid-late next week) if there are no other concerns / comments.

THank you again @mbrobbel for pushing this along

@alamb
Copy link
Contributor

alamb commented Jan 31, 2025

I'll merge this PR in the next day or two unless anyone else has any additional comments or wants more time to comment

@alamb
Copy link
Contributor

alamb commented Feb 2, 2025

Now that we hare release 54.1.0 let's do it

🚀 Thanks again @mbrobbel for seeing this through.

@alamb
Copy link
Contributor

alamb commented Feb 2, 2025

I merged up from main to run the CI once more before merging this

@alamb alamb merged commit 8baaa8b into apache:main Feb 2, 2025
29 checks passed
@alamb
Copy link
Contributor

alamb commented Feb 2, 2025

Epic work -- can't wait to see how this works in the real world

@alamb alamb added the enhancement Any new improvement worthy of a entry in the changelog label Feb 2, 2025
@mbrobbel mbrobbel deleted the parquet-uuid-schema branch February 2, 2025 12:32
@alamb
Copy link
Contributor

alamb commented Feb 12, 2025

For some reason this PR isn't showing up in the release notes 🤔

It is in the release though

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
arrow Changes to the arrow crate enhancement Any new improvement worthy of a entry in the changelog parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add support for Arrow Extension Types
9 participants