-
Notifications
You must be signed in to change notification settings - Fork 209
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
feat: TableMetadata Statistic Files #799
Conversation
#[serde(with = "_serde_set_statistics")] | ||
SetStatistics { | ||
/// File containing the statistics | ||
statistics: StatisticsFile, | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In IRC and Java SetStatistics
has an additional field snapshot_id
(link).
This field is redundant with StatisticsFile.snapshot_id
and is only used as an assertion in Java.
I removed the redundancy for rust and will start a discussion on the mailing List how to handle this.
As we still need to be spec compliant, we need custom serializer / deserializer.
Slack: https://apache-iceberg.slack.com/archives/C03LG1D563F/p1734109745807119
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for raising this on the dev-list: https://lists.apache.org/thread/6fnrp73wokfqlt5vormpjyjmtvl29ll1
/// names in the table, and the map values are snapshot reference objects. | ||
/// There is always a main branch reference pointing to the current-snapshot-id | ||
/// even if the refs map is null. | ||
pub(crate) refs: HashMap<String, SnapshotReference>, | ||
/// Mapping of snapshot ids to statistics files. | ||
pub(crate) statistics: HashMap<i64, StatisticsFile>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we ever have more than one StatisticsFile
for a snapshot_id
?
In java this is modeled as a mapping snapshot_id
: Vec<StatisticsFile>
, however I couldn't find a way to get more than one element into the Vec
for a snapshot_id
other than deserializing.
https://github.com/apache/iceberg/blob/540d6a6251e31b232fe6ed2413680621454d107a/core/src/main/java/org/apache/iceberg/TableMetadata.java#L1310-L1320
fc26394
to
f1c166c
Compare
pub fn set_statistics(mut self, statistics: StatisticsFile) -> Self { | ||
self.metadata | ||
.statistics | ||
.insert(statistics.snapshot_id, statistics.clone()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to check if a snapshot-id
exists?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about this as well, but then followed java.
Currently statistics and snapshots are quite separate from each other. If we implement your check (which I like), I think we should eventually also implement:
- Upon deserialization discard statistics that belong to nonexistant snapshots
- When a snapshot is removed delete the statistics for it as well
This would result in snapshots for statistics not missing. It is unclear however what should happen to the puffin files in these cases. We would have coherent metadata, but probably also orphan files.
Do we know why the check is not there in Java?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adds
StatisticFile
andPartitionStatisticsFile
to spec, builder and REST TableUpdate.By far most of the code are tests. I hope that the size of the PR is OK.