feat: Table metadata #29

JanKaul · 2023-08-10T15:16:30Z

This PR defines all structures necessary to represent Iceberg Table Metadata. The main focus lies on serialization and deserialization from JSON. Some functionality might need to be added later on.

JanKaul · 2023-08-10T15:16:49Z

I will add some more tests.

Xuanwo

Others LGTM, thanks for you hard work.

crates/iceberg/src/spec/mod.rs

liurenjie1024

Very great work, thanks @JanKaul About the integration test, are you planning to do it in following pr or in this pr?

crates/iceberg/src/spec/schema.rs

crates/iceberg/src/spec/mod.rs

crates/iceberg/src/spec/snapshot.rs

crates/iceberg/src/spec/table_metadata.rs

JanKaul · 2023-08-11T20:41:34Z

By integration tests you mean reading and writing an actual metadata.json file?

liurenjie1024 · 2023-08-12T00:00:49Z

By integration tests you mean reading and writing an actual metadata.json file?

Yes, I mean the files quoted by @Fokko in #28

JanKaul · 2023-08-12T03:51:14Z

I can include it here. Where should I place the files for testing? Should I create a folder at the workspace level?

liurenjie1024 · 2023-08-12T04:58:26Z

I can include it here. Where should I place the files for testing? Should I create a folder at the workspace level?

I see others projects usually put a 'testdata' folder alongside 'src' folder. That's similar to

'''
crates
....iceberg
........src
........testdata
'''

cc @Xuanwo any other suggestions?

liurenjie1024

We are almost there!

crates/iceberg/src/spec/snapshot.rs

crates/iceberg/src/spec/table_metadata.rs

liurenjie1024 · 2023-08-14T12:25:51Z

cc @Fokko PTAL

Fokko · 2023-08-17T09:58:54Z

crates/iceberg/src/spec/schema.rs

+#[derive(Debug, Serialize, Deserialize, PartialEq, Eq, Clone)]
+#[serde(rename_all = "kebab-case")]
+/// Names and types of fields in a table.
+pub(crate) struct SchemaV1 {


Do we really need the SchemaV1. What we do in PyIceberg is when reading a V1, and when encounter a Schema in the schema field:

We just set the schema-id to 0, and add the schema to the schema's field. We consider the metadata equivalent.

SchemaV1 (and SchemaV2) are internal structs and are not visible to a user of the library. SchemaV1 is just used for serialization/deserialization. We can still do your recommended conversion.

The only publicly visible struct for a schema is the Schema struct, which has the same representation for v1 and v2 tables.

crates/iceberg/src/spec/datatypes.rs

Fokko · 2023-08-17T10:08:10Z

crates/iceberg/src/spec/schema.rs

+    type Error = Error;
+    fn try_from(value: SchemaV1) -> Result<Self> {
+        Schema::builder()
+            .with_schema_id(value.schema_id.unwrap_or(DEFAULT_SCHEMA_ID))


It looks like it isn't set to null?

This is called when deserializing a v1 schema into the general Schema struct. If the v1 schema doesn't have a schema id, we assign a default schema_id on read.

liurenjie1024

LGTM, thanks @JanKaul

Fokko · 2023-08-17T10:12:40Z

crates/iceberg/src/spec/snapshot.rs

+#[derive(Debug, Serialize, Deserialize, PartialEq, Eq)]
+#[serde(rename_all = "kebab-case")]
+/// A snapshot represents the state of a table at some time and is used to access the complete set of data files in the table.
+pub(crate) struct SnapshotV2 {


Again, I would combine the V1 and V2. The sequence_number is added later on, and there is some logic to set it afterward:

def _inherit_sequence_number(entry: ManifestEntry, manifest: ManifestFile) -> ManifestEntry: """Inherits the sequence numbers. More information in the spec: https://iceberg.apache.org/spec/#sequence-number-inheritance Args: entry: The manifest entry that has null sequence numbers. manifest: The manifest that has a sequence number. Returns: The manifest entry with the sequence numbers set. """ # The snapshot_id is required in V1, inherit with V2 when null if entry.snapshot_id is None: entry.snapshot_id = manifest.added_snapshot_id # in v1 tables, the data sequence number is not persisted and can be safely defaulted to 0 # in v2 tables, the data sequence number should be inherited iff the entry status is ADDED if entry.data_sequence_number is None and (manifest.sequence_number == 0 or entry.status == ManifestEntryStatus.ADDED): entry.data_sequence_number = manifest.sequence_number # in v1 tables, the file sequence number is not persisted and can be safely defaulted to 0 # in v2 tables, the file sequence number should be inherited iff the entry status is ADDED if entry.file_sequence_number is None and (manifest.sequence_number == 0 or entry.status == ManifestEntryStatus.ADDED): # Only available in V2, always 0 in V1 entry.file_sequence_number = manifest.sequence_number return entry

This can happen when deserializing the JSON, or later on (like we do in PyIceberg).

I don't quite understand, the spec says it's required in v2?

The inheritance I think it's for manifest? But snapshot should have a sequence number in its json?

Fokko · 2023-08-17T10:15:36Z

crates/iceberg/src/spec/snapshot.rs

+            parent_snapshot_id: v2.parent_snapshot_id,
+            sequence_number: v2.sequence_number,
+            timestamp_ms: v2.timestamp_ms,
+            manifest_list: match v2.manifest_list {


Hmm, in PyIceberg we don't check for the manifests field. cc @rdblue

The spec says it's required in v2?

Fokko · 2023-08-17T10:20:33Z

crates/iceberg/src/spec/table_metadata.rs

+    }
+
+    #[test]
+    fn test_table_data_v1() {


I would recommend making a very minimal v1 spec, where schema is present, but schemas is missing. Same with partition-spec and partition-specs missing. And for sort-order etc.

Fokko · 2023-08-17T10:23:34Z

Left some comments, great work @JanKaul 🚀

liurenjie1024 · 2023-08-17T10:58:14Z

Hi, @JanKaul I would suggest to add integration tests with json data in following pr. This pr is a little to large for me.

liurenjie1024 · 2023-08-18T03:50:14Z

It seems that the V1/V2 suffix is not clear enough to show that they are only used for making ser/de easier to write, and caused some misunderstanding for reviewers. I would suggest two improvements:

Move structs for serde into private modules such as _serde
Add comments to explain that these are only for format conversion, not user facing api.

cc @JanKaul @Fokko

JanKaul · 2023-08-18T04:20:30Z

Good idea, thank you for your great comments!

liurenjie1024 · 2023-08-18T05:45:43Z

To add some background here about the design philosophy here for reviews not familiar with rust:

All structs with the suffix V1/V2 are used for making serializtion/deserialization easier to maintain. It's sth like handwritten schema definition of specs, and they will be discard after reading from/writing to disk file, and it's not user facing. Unlike java/python, rust has no runtime reflection, and the serializtion/deserialization codes are generated in compile time.
About access modifiers. pub in rust is similar to public in java, which means accessible to code outside of package. pub(crate) is similar to default access modifier, which is only visible to codes in same package.

We have a discussion about the overall structure de# #2 #3

cc @Fokko Hope this comment can help you understand it better.

Fokko

Thanks for working on this @JanKaul and @liurenjie1024 for teaching me on Rust, appreciate it!

This looks good, thanks!

Fokko · 2023-08-21T08:51:52Z

crates/iceberg/src/spec/table_metadata.rs

+            .with_partition_field(PartitionField {
+                name: "ts_day".to_string(),
+                transform: Transform::Day,
+                source_id: 4,


Idk what the best place for Rust is to do validation, but in this case, source id 4 does not exist in the current schema.

Good point, it is probably best to do it during deserialization. We should add it in another PR.

JanKaul requested review from Fokko, liurenjie1024 and Xuanwo August 10, 2023 15:17

Xuanwo reviewed Aug 11, 2023

View reviewed changes

crates/iceberg/src/spec/mod.rs Outdated Show resolved Hide resolved

liurenjie1024 reviewed Aug 11, 2023

View reviewed changes

crates/iceberg/src/spec/table_metadata.rs Outdated Show resolved Hide resolved

liurenjie1024 reviewed Aug 14, 2023

View reviewed changes

crates/iceberg/src/spec/snapshot.rs Outdated Show resolved Hide resolved

crates/iceberg/src/spec/table_metadata.rs Outdated Show resolved Hide resolved

JanKaul added 17 commits August 17, 2023 09:05

serde schemav1 & schemav2

73f7d2d

fix default schema id

66a9f27

implement snapshot

df7e60a

add partition spec

4488e65

add license

b61a64c

add sortorder

0e25086

fix initial & write default

ec1dbac

serialize/deserialize table metadata

771d86c

impl table metadata

7ae7374

fix docs

fb65d1f

fix clippy warnings

e3e7a49

change visibility

df190ec

fix rebase

2a12af4

fix clippy warnings

1785d3a

fix transform

4080af5

introduce static

1263f0f

fix typo

86223d8

JanKaul added 3 commits August 17, 2023 09:34

current snapshot returns option

336b8cd

remove panic from snapshot conversion

4c64fc1

check if current_snapshot_id is -1

d6b9179

Fokko reviewed Aug 17, 2023

View reviewed changes

crates/iceberg/src/spec/datatypes.rs Show resolved Hide resolved

Fokko reviewed Aug 17, 2023

View reviewed changes

liurenjie1024 approved these changes Aug 17, 2023

View reviewed changes

Fokko reviewed Aug 17, 2023

View reviewed changes

fix schema

b8fd0ad

JanKaul force-pushed the table-metadata branch from 6a3044a to b8fd0ad Compare August 17, 2023 12:14

JanKaul added 3 commits August 17, 2023 14:31

use schema field as fallback in v1 table metadata

9e2f07e

use partition spec as fallback in v1 metadata

926ada3

fix parition spec

6b5f7b1

JanKaul added 6 commits August 18, 2023 08:03

introduce _serde module for schema

87b58ad

introduce _serde module for snapshot

5f39039

introduce _serde module for table_metadata

7e3aa59

fix docs

01d5d88

fix typo

236ca40

use minimal table metadata for v1 test

3f29b7b

ZENOTME mentioned this pull request Aug 20, 2023

Read ManifestList, Manifest #36

Closed

Fokko approved these changes Aug 21, 2023

View reviewed changes

Fokko merged commit bdc66a0 into apache:main Aug 21, 2023
7 checks passed

JanKaul deleted the table-metadata branch March 4, 2024 15:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Table metadata #29

feat: Table metadata #29

JanKaul commented Aug 10, 2023

JanKaul commented Aug 10, 2023

Xuanwo left a comment

liurenjie1024 left a comment

JanKaul commented Aug 11, 2023

liurenjie1024 commented Aug 12, 2023

JanKaul commented Aug 12, 2023

liurenjie1024 commented Aug 12, 2023 •

edited

Loading

liurenjie1024 left a comment

liurenjie1024 commented Aug 14, 2023

Fokko Aug 17, 2023

JanKaul Aug 17, 2023 •

edited

Loading

Fokko Aug 17, 2023

JanKaul Aug 17, 2023

liurenjie1024 left a comment

Fokko Aug 17, 2023

liurenjie1024 Aug 18, 2023

Fokko Aug 17, 2023

liurenjie1024 Aug 18, 2023

Fokko Aug 17, 2023

Fokko commented Aug 17, 2023

liurenjie1024 commented Aug 17, 2023

liurenjie1024 commented Aug 18, 2023

JanKaul commented Aug 18, 2023

liurenjie1024 commented Aug 18, 2023

Fokko left a comment

Fokko Aug 21, 2023

JanKaul Aug 21, 2023

feat: Table metadata #29

feat: Table metadata #29

Conversation

JanKaul commented Aug 10, 2023

JanKaul commented Aug 10, 2023

Xuanwo left a comment

Choose a reason for hiding this comment

liurenjie1024 left a comment

Choose a reason for hiding this comment

JanKaul commented Aug 11, 2023

liurenjie1024 commented Aug 12, 2023

JanKaul commented Aug 12, 2023

liurenjie1024 commented Aug 12, 2023 • edited Loading

liurenjie1024 left a comment

Choose a reason for hiding this comment

liurenjie1024 commented Aug 14, 2023

Choose a reason for hiding this comment

JanKaul Aug 17, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

liurenjie1024 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Fokko commented Aug 17, 2023

liurenjie1024 commented Aug 17, 2023

liurenjie1024 commented Aug 18, 2023

JanKaul commented Aug 18, 2023

liurenjie1024 commented Aug 18, 2023

Fokko left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

liurenjie1024 commented Aug 12, 2023 •

edited

Loading

JanKaul Aug 17, 2023 •

edited

Loading