-
Notifications
You must be signed in to change notification settings - Fork 209
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
test: append partition data file #742
Conversation
f5a9d4a
to
8e0a702
Compare
Thanks @feniljain for working on this 🙌 Did some checks: First {
"format-version" : 2,
"table-uuid" : "eb83b77f-c2c3-473c-a138-444a3de61213",
"location" : "s3://icebergdata/demo/iceberg/rust/append_partition_data_file_test",
"last-sequence-number" : 0,
"last-updated-ms" : 1733259665987,
"last-column-id" : 3,
"current-schema-id" : 0,
"schemas" : [ {
"type" : "struct",
"schema-id" : 0,
"identifier-field-ids" : [ 2 ],
"fields" : [ {
"id" : 1,
"name" : "foo",
"required" : false,
"type" : "string"
}, {
"id" : 2,
"name" : "bar",
"required" : true,
"type" : "int"
}, {
"id" : 3,
"name" : "baz",
"required" : false,
"type" : "boolean"
} ]
} ],
"default-spec-id" : 0,
"partition-specs" : [ {
"spec-id" : 0,
"fields" : [ {
"name" : "id",
"transform" : "identity",
"source-id" : 2,
"field-id" : 1000
} ]
} ],
"last-partition-id" : 1000,
"default-sort-order-id" : 0,
"sort-orders" : [ {
"order-id" : 0,
"fields" : [ ]
} ],
"properties" : {
"write.parquet.compression-codec" : "zstd"
},
"current-snapshot-id" : -1,
"refs" : { },
"snapshots" : [ ],
"statistics" : [ ],
"partition-statistics" : [ ],
"snapshot-log" : [ ],
"metadata-log" : [ ]
} Without a commit, the After the commit: {
"format-version" : 2,
"table-uuid" : "eb83b77f-c2c3-473c-a138-444a3de61213",
"location" : "s3://icebergdata/demo/iceberg/rust/append_partition_data_file_test",
"last-sequence-number" : 1,
"last-updated-ms" : 1733259666572,
"last-column-id" : 3,
"current-schema-id" : 0,
"schemas" : [ {
"type" : "struct",
"schema-id" : 0,
"identifier-field-ids" : [ 2 ],
"fields" : [ {
"id" : 1,
"name" : "foo",
"required" : false,
"type" : "string"
}, {
"id" : 2,
"name" : "bar",
"required" : true,
"type" : "int"
}, {
"id" : 3,
"name" : "baz",
"required" : false,
"type" : "boolean"
} ]
} ],
"default-spec-id" : 0,
"partition-specs" : [ {
"spec-id" : 0,
"fields" : [ {
"name" : "id",
"transform" : "identity",
"source-id" : 2,
"field-id" : 1000
} ]
} ],
"last-partition-id" : 1000,
"default-sort-order-id" : 0,
"sort-orders" : [ {
"order-id" : 0,
"fields" : [ ]
} ],
"properties" : {
"write.parquet.compression-codec" : "zstd"
},
"current-snapshot-id" : 8826880672679595429,
"refs" : {
"main" : {
"snapshot-id" : 8826880672679595429,
"type" : "branch"
}
},
"snapshots" : [ {
"sequence-number" : 1,
"snapshot-id" : 8826880672679595429,
"timestamp-ms" : 1733259666572,
"summary" : {
"operation" : "append"
},
"manifest-list" : "s3://icebergdata/demo/iceberg/rust/append_partition_data_file_test/metadata/snap-8826880672679595429-0-01938e53-a487-7ee2-a75e-c061dea0853c.avro",
"schema-id" : 0
} ],
"statistics" : [ ],
"partition-statistics" : [ ],
"snapshot-log" : [ {
"timestamp-ms" : 1733259666572,
"snapshot-id" : 8826880672679595429
} ],
"metadata-log" : [ {
"timestamp-ms" : 1733259665987,
"metadata-file" : "s3://icebergdata/demo/iceberg/rust/append_partition_data_file_test/metadata/00000-647ca34f-8a7b-4a44-8d28-775bc62ef650.metadata.json"
} ]
} Which looks good. {
"manifest_path": "s3://icebergdata/demo/iceberg/rust/append_partition_data_file_test/metadata/01938e53-a487-7ee2-a75e-c061dea0853c-m0.avro",
"manifest_length": 3391,
"partition_spec_id": 0,
"content": 0,
"sequence_number": 1,
"min_sequence_number": 1,
"added_snapshot_id": 8826880672679595000,
"added_files_count": 1,
"existing_files_count": 0,
"deleted_files_count": 0,
"added_rows_count": 2,
"existing_rows_count": 0,
"deleted_rows_count": 0,
"partitions": {
"array": [
{
"contains_null": false,
"contains_nan": null,
"lower_bound": {
"bytes": "d\u0000\u0000\u0000"
},
"upper_bound": {
"bytes": "d\u0000\u0000\u0000"
}
}
]
},
"key_metadata": null
} Which also looks good. The snapshot: {
"status": 1,
"snapshot_id": null,
"sequence_number": null,
"file_sequence_number": null,
"data_file": {
"content": 0,
"file_path": "s3://icebergdata/demo/iceberg/rust/append_partition_data_file_test/data/test-00000.parquet",
"file_format": "PARQUET",
"partition": {
"id": {
"int": 100
}
},
"record_count": 2,
"file_size_in_bytes": 1160,
"column_sizes": {
"array": [
{
"key": 3,
"value": 36
},
{
"key": 1,
"value": 74
},
{
"key": 2,
"value": 55
}
]
},
"value_counts": {
"array": [
{
"key": 1,
"value": 2
},
{
"key": 3,
"value": 2
},
{
"key": 2,
"value": 2
}
]
},
"null_value_counts": {
"array": [
{
"key": 2,
"value": 0
},
{
"key": 3,
"value": 0
},
{
"key": 1,
"value": 0
}
]
},
"nan_value_counts": {
"array": []
},
"lower_bounds": {
"array": [
{
"key": 2,
"value": "d\u0000\u0000\u0000"
},
{
"key": 3,
"value": "\u0000"
},
{
"key": 1,
"value": "foo1"
}
]
},
"upper_bounds": {
"array": [
{
"key": 1,
"value": "foo2"
},
{
"key": 2,
"value": "d\u0000\u0000\u0000"
},
{
"key": 3,
"value": "\u0001"
}
]
},
"key_metadata": {
"bytes": ""
},
"split_offsets": {
"array": [
4
]
},
"equality_ids": {
"array": []
},
"sort_order_id": null
}
} Probably we just want to set the |
Hey @Fokko 👋🏻 Thanks a lot for checking up in detail! Can I take up both of the issues as both are related to this test itself? 😅 Also, slightly tangential, but I have a small idea 💡, do you think we can use snapshot based testing over these files for end-to-end tests? Snapshot based testing would allow us to not check a lot of fields in every test, and we can just compare + evolve snapshots as new features are added. I have seen the idea being used with great success in projects like rust-analyzer before, and crates like https://docs.rs/insta/latest/insta/ can help us set it up. If you think this makes sense and I should create a new issue to discuss this, do let me know, will do that :) |
Regarding the testing, I would like to invite @liurenjie1024 @Xuanwo and @ZENOTME to give an opinion on that :D I'm just tipping my toe into lake rust 🦀 |
crates/integration_tests/tests/append_partition_data_file_test.rs
Outdated
Show resolved
Hide resolved
Thanks @feniljain! I try insta locally and I think this is a cool idea. This tool enables us to compare the field of snapshots conveniently. One thing we need to address maybe filter out some random field. A creative idea is to support Avro format files, allowing us to create snapshots of the entire Iceberg metadata, which can then be used for quick comparisons in end-to-end tests. I also agree to open an issue to discuss this. |
038b790
to
138c9d9
Compare
138c9d9
to
f68e80d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @feniljain for working on this and thank you @Fokko & @ZENOTME's review. Let's move!
Issue Resolved
Closes #720
Description
DataFileWriterBuilder
when compared to schemaDataFileWriterBuilder
is different from schematransaction.rs
Output screenshots