test: append partition data file #742

feniljain · 2024-11-30T10:42:21Z

Issue Resolved

Closes #720

Description

Have added test case for base partition data file addition
Added a test case where different "type" of partition field is set in DataFileWriterBuilder when compared to schema
Added a test case where number of partition fields set in DataFileWriterBuilder is different from schema
Fixed noticed spelling mistakes in transaction.rs

Output screenshots

Fokko · 2024-12-03T21:14:30Z

Thanks @feniljain for working on this 🙌 Did some checks:

First metadata.json:

{
  "format-version" : 2,
  "table-uuid" : "eb83b77f-c2c3-473c-a138-444a3de61213",
  "location" : "s3://icebergdata/demo/iceberg/rust/append_partition_data_file_test",
  "last-sequence-number" : 0,
  "last-updated-ms" : 1733259665987,
  "last-column-id" : 3,
  "current-schema-id" : 0,
  "schemas" : [ {
    "type" : "struct",
    "schema-id" : 0,
    "identifier-field-ids" : [ 2 ],
    "fields" : [ {
      "id" : 1,
      "name" : "foo",
      "required" : false,
      "type" : "string"
    }, {
      "id" : 2,
      "name" : "bar",
      "required" : true,
      "type" : "int"
    }, {
      "id" : 3,
      "name" : "baz",
      "required" : false,
      "type" : "boolean"
    } ]
  } ],
  "default-spec-id" : 0,
  "partition-specs" : [ {
    "spec-id" : 0,
    "fields" : [ {
      "name" : "id",
      "transform" : "identity",
      "source-id" : 2,
      "field-id" : 1000
    } ]
  } ],
  "last-partition-id" : 1000,
  "default-sort-order-id" : 0,
  "sort-orders" : [ {
    "order-id" : 0,
    "fields" : [ ]
  } ],
  "properties" : {
    "write.parquet.compression-codec" : "zstd"
  },
  "current-snapshot-id" : -1,
  "refs" : { },
  "snapshots" : [ ],
  "statistics" : [ ],
  "partition-statistics" : [ ],
  "snapshot-log" : [ ],
  "metadata-log" : [ ]
}

Without a commit, the current-snapshot-id should be null instead of -1.

After the commit:

{
  "format-version" : 2,
  "table-uuid" : "eb83b77f-c2c3-473c-a138-444a3de61213",
  "location" : "s3://icebergdata/demo/iceberg/rust/append_partition_data_file_test",
  "last-sequence-number" : 1,
  "last-updated-ms" : 1733259666572,
  "last-column-id" : 3,
  "current-schema-id" : 0,
  "schemas" : [ {
    "type" : "struct",
    "schema-id" : 0,
    "identifier-field-ids" : [ 2 ],
    "fields" : [ {
      "id" : 1,
      "name" : "foo",
      "required" : false,
      "type" : "string"
    }, {
      "id" : 2,
      "name" : "bar",
      "required" : true,
      "type" : "int"
    }, {
      "id" : 3,
      "name" : "baz",
      "required" : false,
      "type" : "boolean"
    } ]
  } ],
  "default-spec-id" : 0,
  "partition-specs" : [ {
    "spec-id" : 0,
    "fields" : [ {
      "name" : "id",
      "transform" : "identity",
      "source-id" : 2,
      "field-id" : 1000
    } ]
  } ],
  "last-partition-id" : 1000,
  "default-sort-order-id" : 0,
  "sort-orders" : [ {
    "order-id" : 0,
    "fields" : [ ]
  } ],
  "properties" : {
    "write.parquet.compression-codec" : "zstd"
  },
  "current-snapshot-id" : 8826880672679595429,
  "refs" : {
    "main" : {
      "snapshot-id" : 8826880672679595429,
      "type" : "branch"
    }
  },
  "snapshots" : [ {
    "sequence-number" : 1,
    "snapshot-id" : 8826880672679595429,
    "timestamp-ms" : 1733259666572,
    "summary" : {
      "operation" : "append"
    },
    "manifest-list" : "s3://icebergdata/demo/iceberg/rust/append_partition_data_file_test/metadata/snap-8826880672679595429-0-01938e53-a487-7ee2-a75e-c061dea0853c.avro",
    "schema-id" : 0
  } ],
  "statistics" : [ ],
  "partition-statistics" : [ ],
  "snapshot-log" : [ {
    "timestamp-ms" : 1733259666572,
    "snapshot-id" : 8826880672679595429
  } ],
  "metadata-log" : [ {
    "timestamp-ms" : 1733259665987,
    "metadata-file" : "s3://icebergdata/demo/iceberg/rust/append_partition_data_file_test/metadata/00000-647ca34f-8a7b-4a44-8d28-775bc62ef650.metadata.json"
  } ]
}

Which looks good.

{
    "manifest_path": "s3://icebergdata/demo/iceberg/rust/append_partition_data_file_test/metadata/01938e53-a487-7ee2-a75e-c061dea0853c-m0.avro",
    "manifest_length": 3391,
    "partition_spec_id": 0,
    "content": 0,
    "sequence_number": 1,
    "min_sequence_number": 1,
    "added_snapshot_id": 8826880672679595000,
    "added_files_count": 1,
    "existing_files_count": 0,
    "deleted_files_count": 0,
    "added_rows_count": 2,
    "existing_rows_count": 0,
    "deleted_rows_count": 0,
    "partitions": {
        "array": [
            {
                "contains_null": false,
                "contains_nan": null,
                "lower_bound": {
                    "bytes": "d\u0000\u0000\u0000"
                },
                "upper_bound": {
                    "bytes": "d\u0000\u0000\u0000"
                }
            }
        ]
    },
    "key_metadata": null
}

Which also looks good. The snapshot:

{
    "status": 1,
    "snapshot_id": null,
    "sequence_number": null,
    "file_sequence_number": null,
    "data_file": {
        "content": 0,
        "file_path": "s3://icebergdata/demo/iceberg/rust/append_partition_data_file_test/data/test-00000.parquet",
        "file_format": "PARQUET",
        "partition": {
            "id": {
                "int": 100
            }
        },
        "record_count": 2,
        "file_size_in_bytes": 1160,
        "column_sizes": {
            "array": [
                {
                    "key": 3,
                    "value": 36
                },
                {
                    "key": 1,
                    "value": 74
                },
                {
                    "key": 2,
                    "value": 55
                }
            ]
        },
        "value_counts": {
            "array": [
                {
                    "key": 1,
                    "value": 2
                },
                {
                    "key": 3,
                    "value": 2
                },
                {
                    "key": 2,
                    "value": 2
                }
            ]
        },
        "null_value_counts": {
            "array": [
                {
                    "key": 2,
                    "value": 0
                },
                {
                    "key": 3,
                    "value": 0
                },
                {
                    "key": 1,
                    "value": 0
                }
            ]
        },
        "nan_value_counts": {
            "array": []
        },
        "lower_bounds": {
            "array": [
                {
                    "key": 2,
                    "value": "d\u0000\u0000\u0000"
                },
                {
                    "key": 3,
                    "value": "\u0000"
                },
                {
                    "key": 1,
                    "value": "foo1"
                }
            ]
        },
        "upper_bounds": {
            "array": [
                {
                    "key": 1,
                    "value": "foo2"
                },
                {
                    "key": 2,
                    "value": "d\u0000\u0000\u0000"
                },
                {
                    "key": 3,
                    "value": "\u0001"
                }
            ]
        },
        "key_metadata": {
            "bytes": ""
        },
        "split_offsets": {
            "array": [
                4
            ]
        },
        "equality_ids": {
            "array": []
        },
        "sort_order_id": null
    }
}

Probably we just want to set the key_metadata to null instead of empty bytes.

feniljain · 2024-12-04T06:05:05Z

Hey @Fokko 👋🏻

Thanks a lot for checking up in detail! Can I take up both of the issues as both are related to this test itself? 😅

Also, slightly tangential, but I have a small idea 💡, do you think we can use snapshot based testing over these files for end-to-end tests? Snapshot based testing would allow us to not check a lot of fields in every test, and we can just compare + evolve snapshots as new features are added. I have seen the idea being used with great success in projects like rust-analyzer before, and crates like https://docs.rs/insta/latest/insta/ can help us set it up.

If you think this makes sense and I should create a new issue to discuss this, do let me know, will do that :)

Fokko · 2024-12-06T14:35:26Z

Regarding the testing, I would like to invite @liurenjie1024 @Xuanwo and @ZENOTME to give an opinion on that :D I'm just tipping my toe into lake rust 🦀

crates/integration_tests/tests/append_partition_data_file_test.rs

ZENOTME · 2024-12-06T15:31:55Z

Hey @Fokko 👋🏻

Thanks a lot for checking up in detail! Can I take up both of the issues as both are related to this test itself? 😅

Also, slightly tangential, but I have a small idea 💡, do you think we can use snapshot based testing over these files for end-to-end tests? Snapshot based testing would allow us to not check a lot of fields in every test, and we can just compare + evolve snapshots as new features are added. I have seen the idea being used with great success in projects like rust-analyzer before, and crates like https://docs.rs/insta/latest/insta/ can help us set it up.

If you think this makes sense and I should create a new issue to discuss this, do let me know, will do that :)

Thanks @feniljain! I try insta locally and I think this is a cool idea. This tool enables us to compare the field of snapshots conveniently. One thing we need to address maybe filter out some random field. A creative idea is to support Avro format files, allowing us to create snapshots of the entire Iceberg metadata, which can then be used for quick comparisons in end-to-end tests. I also agree to open an issue to discuss this.

Xuanwo

Thank you @feniljain for working on this and thank you @Fokko & @ZENOTME's review. Let's move!

feniljain force-pushed the iceberg_partition_test branch 2 times, most recently from f5a9d4a to 8e0a702 Compare November 30, 2024 10:52

Fokko previously approved these changes Dec 3, 2024

View reviewed changes

This was referenced Dec 3, 2024

Write null for current-snapshot-id #752

Closed

Write null for key_metadata instead of empty bytes #753

Closed

Fokko marked this pull request as ready for review December 6, 2024 14:35

ZENOTME reviewed Dec 6, 2024

View reviewed changes

crates/integration_tests/tests/append_partition_data_file_test.rs Outdated Show resolved Hide resolved

feniljain mentioned this pull request Dec 14, 2024

fix: set key_metadata to Null by default #800

Merged

feniljain force-pushed the iceberg_partition_test branch from 038b790 to 138c9d9 Compare December 14, 2024 10:18

feniljain added 2 commits December 14, 2024 16:39

test: append partition data file

2213be5

chore: fix compatible spell mistake

f68e80d

feniljain dismissed Fokko’s stale review via f68e80d December 14, 2024 11:12

feniljain force-pushed the iceberg_partition_test branch from 138c9d9 to f68e80d Compare December 14, 2024 11:12

feniljain requested review from ZENOTME and Fokko December 14, 2024 11:20

Xuanwo approved these changes Dec 14, 2024

View reviewed changes

Xuanwo merged commit 7981def into apache:main Dec 14, 2024
16 checks passed

feniljain mentioned this pull request Dec 14, 2024

Snapshot Testing for Integration Tests #803

Open

feniljain deleted the iceberg_partition_test branch December 14, 2024 14:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test: append partition data file #742

test: append partition data file #742

feniljain commented Nov 30, 2024 •

edited

Loading

Fokko commented Dec 3, 2024 •

edited

Loading

feniljain commented Dec 4, 2024 •

edited

Loading

Fokko commented Dec 6, 2024

ZENOTME commented Dec 6, 2024 •

edited

Loading

Xuanwo left a comment

test: append partition data file #742

test: append partition data file #742

Conversation

feniljain commented Nov 30, 2024 • edited Loading

Issue Resolved

Description

Output screenshots

Fokko commented Dec 3, 2024 • edited Loading

feniljain commented Dec 4, 2024 • edited Loading

Fokko commented Dec 6, 2024

ZENOTME commented Dec 6, 2024 • edited Loading

Xuanwo left a comment

Choose a reason for hiding this comment

feniljain commented Nov 30, 2024 •

edited

Loading

Fokko commented Dec 3, 2024 •

edited

Loading

feniljain commented Dec 4, 2024 •

edited

Loading

ZENOTME commented Dec 6, 2024 •

edited

Loading