Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

feat: TableMetadata Statistic Files #799

Merged
merged 5 commits into from
Dec 16, 2024
Merged

Conversation

c-thiel
Copy link
Collaborator

@c-thiel c-thiel commented Dec 13, 2024

Adds StatisticFile and PartitionStatisticsFile to spec, builder and REST TableUpdate.

By far most of the code are tests. I hope that the size of the PR is OK.

@c-thiel c-thiel changed the title feat: TableMetadata Statistics feat: TableMetadata Statistic Files Dec 13, 2024
Comment on lines +450 to +454
#[serde(with = "_serde_set_statistics")]
SetStatistics {
/// File containing the statistics
statistics: StatisticsFile,
},
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In IRC and Java SetStatistics has an additional field snapshot_id (link).
This field is redundant with StatisticsFile.snapshot_id and is only used as an assertion in Java.

I removed the redundancy for rust and will start a discussion on the mailing List how to handle this.

As we still need to be spec compliant, we need custom serializer / deserializer.

Slack: https://apache-iceberg.slack.com/archives/C03LG1D563F/p1734109745807119

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for raising this on the dev-list: https://lists.apache.org/thread/6fnrp73wokfqlt5vormpjyjmtvl29ll1

/// names in the table, and the map values are snapshot reference objects.
/// There is always a main branch reference pointing to the current-snapshot-id
/// even if the refs map is null.
pub(crate) refs: HashMap<String, SnapshotReference>,
/// Mapping of snapshot ids to statistics files.
pub(crate) statistics: HashMap<i64, StatisticsFile>,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we ever have more than one StatisticsFile for a snapshot_id?
In java this is modeled as a mapping snapshot_id : Vec<StatisticsFile>, however I couldn't find a way to get more than one element into the Vec for a snapshot_id other than deserializing.
https://github.com/apache/iceberg/blob/540d6a6251e31b232fe6ed2413680621454d107a/core/src/main/java/org/apache/iceberg/TableMetadata.java#L1310-L1320

pub fn set_statistics(mut self, statistics: StatisticsFile) -> Self {
self.metadata
.statistics
.insert(statistics.snapshot_id, statistics.clone());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to check if a snapshot-id exists?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about this as well, but then followed java.

Currently statistics and snapshots are quite separate from each other. If we implement your check (which I like), I think we should eventually also implement:

  • Upon deserialization discard statistics that belong to nonexistant snapshots
  • When a snapshot is removed delete the statistics for it as well

This would result in snapshots for statistics not missing. It is unclear however what should happen to the puffin files in these cases. We would have coherent metadata, but probably also orphan files.

Do we know why the check is not there in Java?

Fokko
Fokko previously approved these changes Dec 16, 2024
Copy link
Member

@Xuanwo Xuanwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @c-thiel for working on this and also thanks @Fokko's active review. Let's move!

@Xuanwo Xuanwo merged commit f00d89b into apache:main Dec 16, 2024
16 checks passed
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants