-
Notifications
You must be signed in to change notification settings - Fork 867
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Add ExtensionType
trait and CanonicalExtensionType
enum
#5822
Conversation
Maybe ExtensionType could be a trait to be externally implementable and not limited to canonical extension types? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes sense to me, and seems like an unobtrusive way to provide better ergonomics for extension types.
That being said I've limited exposure to them so getting some broader perspectives might be valuable, perhaps on the mailing list or something?
I haven't had time to work on this, but I'm planning to pick this up later. |
Thanks @mbrobbel -- marking this PR as draft as I think it still has planned but not yet completed work |
Thanks for the PR. This seems very useful to support not yet mapped logical types. e.g. json |
|
6b883c5
to
e35630a
Compare
ExtensionType
for uuid
and map to parquet logical typeExtensionType
trait and CanonicalExtensionType
enum
…onst was added in 1.81
This is ready for another round of reviews. I updated the trait and changed some field methods, added implementations for the current set of canonical extension types (behind a feature flag) and added more tests and docs. Maybe this PR is too big now, but I don't mind splitting it up if needed. Roundtripping through parquet is not implemented in this PR. |
This looks great to me! I really like this approach |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks pretty nice!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops forgot to commit changes
// TODO: roundtrip | ||
// let arrow_schema = parquet_to_arrow_schema(&parquet_schema, None)?; | ||
// assert_eq!( | ||
// arrow_schema.field(0).try_extension_type::<Json>()?, | ||
// Json::default() | ||
// ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this left for a future PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this requires a bit more work to propagate correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Filed Support round tripping extension types in parquet #7063 to track
/// | ||
/// // We just return a reference to the Uuid version. | ||
/// fn metadata(&self) -> &Self::Metadata { | ||
/// &self.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My first go at implementing this trait I used:
pub struct PointType(CoordType, Dimension);
impl ExtensionType for PointType {
const NAME: &'static str = "geoarrow.point";
type Metadata = (CoordType, Dimension);
fn metadata(&self) -> &Self::Metadata {
&(self.0, self.1)
}
}
But that won't work for this because cannot return reference to temporary value
.
So essentially type Metadata
needs to be the same as whatever is defined inside the extension type struct, and that needs to be a single item, so that it can return Self::Metadata
without creating any new items?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the current signature requires that. We could change it to return Self::Metadata
?
The pattern I used for the canonical extension types with metadata applied to your example:
pub struct PointType(PointTypeMetadata);
pub struct PointTypeMetadata(CoordType, Dimension);
impl ExtensionType for PointType {
const NAME: &'static str = "geoarrow.point";
type Metadata = PointTypeMetadata;
fn metadata(&self) -> &Self::Metadata {
&self.0
}
}
/// | ||
/// [`Field`]: crate::Field | ||
/// [`Field::metadata`]: crate::Field::metadata | ||
fn serialize_metadata(&self) -> Option<String>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is interesting because it requires the type to contain all information to construct its Arrow extension metadata. This is not how I've implemented it so far in geoarrow-rs. For example, I define a Point
type as parameterized only by its "coordinate type" (i.e. FixedSizeList
coordinates or Struct[x: Float, y: Float]
coordinates) and its dimension (XY
, XYZ
, and so on). Separately, I store more metadata, and each array type stores both pieces of data.
So the data type alone is not enough information to serialize the full GeoArrow metadata, which also needs the Coordinate Reference System (CRS). It's a good question whether the CRS should be a part of the type here. Perhaps it should, but then we need to support CRS equality, which is quite tricky to do, so a single logical CRS can be represented in different ways.
cc @paleolimbot , this might be a question for you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah this is great, I'll take a better look at this tomorrow and figure out if the current definition works with GeoArrow.
Note that in the spec extension types are defined for Field, not for Array.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I consider the CRS a property of the type in geoarrow-c and geoarrow-python (which is also how it is implemented in the draft Parquet LogicalType, GeoParquet, and the draft Iceberg spec). My reading of this is that it's compatible with GeoArrow (and will be great!). (A Python implementation is here: https://github.com/geoarrow/geoarrow-python/blob/main/geoarrow-types/src/geoarrow/types/type_pyarrow.py ).
It's true that CRS equality is thorny (generally it's expressed as a probability)...I implement this as part of the type and implement ==
on the type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like all the pieces are there (in geoarrow-rs
), so I guess it will be straightforward to add ExtensionType
implementations that can be used with the added extension type related methods of Field
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just passing by and had a lot of fun reading this code (thank you)!
Please just ignore my comments if you feel so.
impl ExtensionType for Bool8 { | ||
const NAME: &'static str = "arrow.bool8"; | ||
|
||
type Metadata = &'static str; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe bad idea: use ()
here. This way it's not possible to encode bad metadata value (any string that is not empty)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the proposed definition of the trait, ()
would indicate no metadata and results in the metadata key to be in unset in the field custom metadata map. Note that there is no actual storage for the metadata in this extension type. Both serialization and deserialization make sure the metadata is always an empty string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok! Missed that! Thanks!
"" => Ok(""), | ||
_ => Err(ArrowError::InvalidArgumentError(ERR.to_owned())), | ||
}, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
slightly shorter:
) | |
if metadata.map_or(false, str::is_empty) { | |
Ok("") | |
} else { | |
Err(ArrowError::InvalidArgumentError("Bool8 extension type expects an empty string as metadata".into())) | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// | ||
/// This type does not have any parameters. | ||
/// | ||
/// Metadata is either an empty string or a JSON string with an empty |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on implementation I guess it's not "an empty string" but "a JSON string with an empty string" (ie "\"\""
and not ""
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! 29a94cb
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe bad idea: instead of an Option<Map<...>>
what about just using a is_empty_object: bool
field or something like this to make invalid states unrepresentable?
Also, you might even hack something like this to only deserialize the empty object:
#[derive(Deserialize)]
struct Empty {}
let is_empty_struct = serde_json::from_str<Empty>(value).is_ok()
and then use the Option<Empty>
type in JsonMetadata
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did consider doing this. The spec states (https://arrow.apache.org/docs/format/CanonicalExtensions.html#json):
Description of the serialization:
Metadata is either an empty string or a JSON string with an empty object. In the future, additional fields may be added, but they are not required to interpret the array.
So I figured, I'll support the future use-case, however, if/when in the future fields are added here it would actually be better to use a strongly typed definition. I like your suggestion of using Empty
here and since the metadata content currently is not pub
, if fields are added in the future, we can just rename and modify the definition of the inner metadata type without introducing a breaking change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great! It might be possible to also use a similar approach for the Bool8
metadata.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is currently no metadata storage for Bool8
, and the difference is that there is no note about future additions to the metadata serialization:
Description of the serialization:
Metadata is an empty string.
So, if there are breaking changes in the spec for this type, it would have to be a breaking change here too anyway.
I think we should leave this PR open at least through the weekend to gather additional feedback. I think given the size/scope of it perhaps we can plan to merge it after I cut the release candidate for the next arrow-rs release (mid-late next week) if there are no other concerns / comments. THank you again @mbrobbel for pushing this along |
I'll merge this PR in the next day or two unless anyone else has any additional comments or wants more time to comment |
Now that we hare release 54.1.0 let's do it 🚀 Thanks again @mbrobbel for seeing this through. |
I merged up from main to run the CI once more before merging this |
Epic work -- can't wait to see how this works in the real world |
For some reason this PR isn't showing up in the release notes 🤔 It is in the release though |
Rationale for this change
It would be nice to better support reading and writing the Arrow canonical
uuid
andjson
extension types with the arrow and parquet crate i.e. mapping between the arrow extension type and the parquet logicaluuid
andjson
types.What changes are included in this PR?
This adds an
ExtensionType
trait, some impls for canonical extension types and aCanonicalExtensionType
enum for canonical extension types.Are there any user-facing changes?
Users can now annotate their fields with extension types, and for
uuid
andjson
they are propagated via the arrow writer to map to the parquetuuid
andjson
logical types.