feat: TableMetadata Statistic Files #799

c-thiel · 2024-12-13T17:52:13Z

Adds StatisticFile and PartitionStatisticsFile to spec, builder and REST TableUpdate.

By far most of the code are tests. I hope that the size of the PR is OK.

c-thiel · 2024-12-13T17:59:26Z

crates/iceberg/src/catalog/mod.rs

+    #[serde(with = "_serde_set_statistics")]
+    SetStatistics {
+        /// File containing the statistics
+        statistics: StatisticsFile,
+    },


In IRC and Java SetStatistics has an additional field snapshot_id (link).
This field is redundant with StatisticsFile.snapshot_id and is only used as an assertion in Java.

I removed the redundancy for rust and will start a discussion on the mailing List how to handle this.

As we still need to be spec compliant, we need custom serializer / deserializer.

Slack: https://apache-iceberg.slack.com/archives/C03LG1D563F/p1734109745807119

Thanks for raising this on the dev-list: https://lists.apache.org/thread/6fnrp73wokfqlt5vormpjyjmtvl29ll1

c-thiel · 2024-12-14T04:40:18Z

crates/iceberg/src/spec/table_metadata.rs

    /// names in the table, and the map values are snapshot reference objects.
    /// There is always a main branch reference pointing to the current-snapshot-id
    /// even if the refs map is null.
    pub(crate) refs: HashMap<String, SnapshotReference>,
+    /// Mapping of snapshot ids to statistics files.
+    pub(crate) statistics: HashMap<i64, StatisticsFile>,


Can we ever have more than one StatisticsFile for a snapshot_id?
In java this is modeled as a mapping snapshot_id : Vec<StatisticsFile>, however I couldn't find a way to get more than one element into the Vec for a snapshot_id other than deserializing.
https://github.com/apache/iceberg/blob/540d6a6251e31b232fe6ed2413680621454d107a/core/src/main/java/org/apache/iceberg/TableMetadata.java#L1310-L1320

Fokko · 2024-12-16T13:32:34Z

crates/iceberg/src/spec/table_metadata_builder.rs

+    pub fn set_statistics(mut self, statistics: StatisticsFile) -> Self {
+        self.metadata
+            .statistics
+            .insert(statistics.snapshot_id, statistics.clone());


Do we want to check if a snapshot-id exists?

I thought about this as well, but then followed java.

Currently statistics and snapshots are quite separate from each other. If we implement your check (which I like), I think we should eventually also implement:

Upon deserialization discard statistics that belong to nonexistant snapshots

When a snapshot is removed delete the statistics for it as well

This would result in snapshots for statistics not missing. It is unclear however what should happen to the puffin files in these cases. We would have coherent metadata, but probably also orphan files.

Do we know why the check is not there in Java?

Xuanwo

Thank you @c-thiel for working on this and also thanks @Fokko's active review. Let's move!

Statistics

1ef489d

c-thiel mentioned this pull request Dec 13, 2024

feat(puffin): Parse Puffin FileMetadata #765

Merged

License header, rename statistics module

3a7a5f2

c-thiel changed the title ~~feat: TableMetadata Statistics~~ feat: TableMetadata Statistic Files Dec 13, 2024

c-thiel commented Dec 13, 2024

View reviewed changes

c-thiel commented Dec 14, 2024

View reviewed changes

TableUpdate RemoveStatistics

71051d2

c-thiel requested review from Fokko and liurenjie1024 December 15, 2024 09:53

Merge remote-tracking branch 'apache/main' into ct/feat-statistics

f1c166c

c-thiel force-pushed the ct/feat-statistics branch from fc26394 to f1c166c Compare December 16, 2024 08:21

Fokko reviewed Dec 16, 2024

View reviewed changes

Fokko previously approved these changes Dec 16, 2024

View reviewed changes

fix iter names

c7a2010

c-thiel dismissed Fokko’s stale review via c7a2010 December 16, 2024 14:05

Xuanwo approved these changes Dec 16, 2024

View reviewed changes

Xuanwo merged commit f00d89b into apache:main Dec 16, 2024
16 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: TableMetadata Statistic Files #799

feat: TableMetadata Statistic Files #799

c-thiel commented Dec 13, 2024 •

edited

Loading

c-thiel Dec 13, 2024

Fokko Dec 16, 2024

c-thiel Dec 14, 2024

Fokko Dec 16, 2024

c-thiel Dec 16, 2024

Xuanwo left a comment

feat: TableMetadata Statistic Files #799

feat: TableMetadata Statistic Files #799

Conversation

c-thiel commented Dec 13, 2024 • edited Loading

c-thiel Dec 13, 2024

Choose a reason for hiding this comment

Fokko Dec 16, 2024

Choose a reason for hiding this comment

c-thiel Dec 14, 2024

Choose a reason for hiding this comment

Fokko Dec 16, 2024

Choose a reason for hiding this comment

c-thiel Dec 16, 2024

Choose a reason for hiding this comment

Xuanwo left a comment

Choose a reason for hiding this comment

c-thiel commented Dec 13, 2024 •

edited

Loading