Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

test: append partition data file #742

Merged
merged 2 commits into from
Dec 14, 2024

Conversation

feniljain
Copy link
Contributor

@feniljain feniljain commented Nov 30, 2024

Issue Resolved

Closes #720

Description

  • Have added test case for base partition data file addition
  • Added a test case where different "type" of partition field is set in DataFileWriterBuilder when compared to schema
  • Added a test case where number of partition fields set in DataFileWriterBuilder is different from schema
  • Fixed noticed spelling mistakes in transaction.rs

Output screenshots

Screenshot 2024-11-30 at 16 07 39

@feniljain feniljain force-pushed the iceberg_partition_test branch 2 times, most recently from f5a9d4a to 8e0a702 Compare November 30, 2024 10:52
@Fokko
Copy link
Contributor

Fokko commented Dec 3, 2024

Thanks @feniljain for working on this 🙌 Did some checks:

First metadata.json:

{
  "format-version" : 2,
  "table-uuid" : "eb83b77f-c2c3-473c-a138-444a3de61213",
  "location" : "s3://icebergdata/demo/iceberg/rust/append_partition_data_file_test",
  "last-sequence-number" : 0,
  "last-updated-ms" : 1733259665987,
  "last-column-id" : 3,
  "current-schema-id" : 0,
  "schemas" : [ {
    "type" : "struct",
    "schema-id" : 0,
    "identifier-field-ids" : [ 2 ],
    "fields" : [ {
      "id" : 1,
      "name" : "foo",
      "required" : false,
      "type" : "string"
    }, {
      "id" : 2,
      "name" : "bar",
      "required" : true,
      "type" : "int"
    }, {
      "id" : 3,
      "name" : "baz",
      "required" : false,
      "type" : "boolean"
    } ]
  } ],
  "default-spec-id" : 0,
  "partition-specs" : [ {
    "spec-id" : 0,
    "fields" : [ {
      "name" : "id",
      "transform" : "identity",
      "source-id" : 2,
      "field-id" : 1000
    } ]
  } ],
  "last-partition-id" : 1000,
  "default-sort-order-id" : 0,
  "sort-orders" : [ {
    "order-id" : 0,
    "fields" : [ ]
  } ],
  "properties" : {
    "write.parquet.compression-codec" : "zstd"
  },
  "current-snapshot-id" : -1,
  "refs" : { },
  "snapshots" : [ ],
  "statistics" : [ ],
  "partition-statistics" : [ ],
  "snapshot-log" : [ ],
  "metadata-log" : [ ]
}

Without a commit, the current-snapshot-id should be null instead of -1.

After the commit:

{
  "format-version" : 2,
  "table-uuid" : "eb83b77f-c2c3-473c-a138-444a3de61213",
  "location" : "s3://icebergdata/demo/iceberg/rust/append_partition_data_file_test",
  "last-sequence-number" : 1,
  "last-updated-ms" : 1733259666572,
  "last-column-id" : 3,
  "current-schema-id" : 0,
  "schemas" : [ {
    "type" : "struct",
    "schema-id" : 0,
    "identifier-field-ids" : [ 2 ],
    "fields" : [ {
      "id" : 1,
      "name" : "foo",
      "required" : false,
      "type" : "string"
    }, {
      "id" : 2,
      "name" : "bar",
      "required" : true,
      "type" : "int"
    }, {
      "id" : 3,
      "name" : "baz",
      "required" : false,
      "type" : "boolean"
    } ]
  } ],
  "default-spec-id" : 0,
  "partition-specs" : [ {
    "spec-id" : 0,
    "fields" : [ {
      "name" : "id",
      "transform" : "identity",
      "source-id" : 2,
      "field-id" : 1000
    } ]
  } ],
  "last-partition-id" : 1000,
  "default-sort-order-id" : 0,
  "sort-orders" : [ {
    "order-id" : 0,
    "fields" : [ ]
  } ],
  "properties" : {
    "write.parquet.compression-codec" : "zstd"
  },
  "current-snapshot-id" : 8826880672679595429,
  "refs" : {
    "main" : {
      "snapshot-id" : 8826880672679595429,
      "type" : "branch"
    }
  },
  "snapshots" : [ {
    "sequence-number" : 1,
    "snapshot-id" : 8826880672679595429,
    "timestamp-ms" : 1733259666572,
    "summary" : {
      "operation" : "append"
    },
    "manifest-list" : "s3://icebergdata/demo/iceberg/rust/append_partition_data_file_test/metadata/snap-8826880672679595429-0-01938e53-a487-7ee2-a75e-c061dea0853c.avro",
    "schema-id" : 0
  } ],
  "statistics" : [ ],
  "partition-statistics" : [ ],
  "snapshot-log" : [ {
    "timestamp-ms" : 1733259666572,
    "snapshot-id" : 8826880672679595429
  } ],
  "metadata-log" : [ {
    "timestamp-ms" : 1733259665987,
    "metadata-file" : "s3://icebergdata/demo/iceberg/rust/append_partition_data_file_test/metadata/00000-647ca34f-8a7b-4a44-8d28-775bc62ef650.metadata.json"
  } ]
}

Which looks good.

{
    "manifest_path": "s3://icebergdata/demo/iceberg/rust/append_partition_data_file_test/metadata/01938e53-a487-7ee2-a75e-c061dea0853c-m0.avro",
    "manifest_length": 3391,
    "partition_spec_id": 0,
    "content": 0,
    "sequence_number": 1,
    "min_sequence_number": 1,
    "added_snapshot_id": 8826880672679595000,
    "added_files_count": 1,
    "existing_files_count": 0,
    "deleted_files_count": 0,
    "added_rows_count": 2,
    "existing_rows_count": 0,
    "deleted_rows_count": 0,
    "partitions": {
        "array": [
            {
                "contains_null": false,
                "contains_nan": null,
                "lower_bound": {
                    "bytes": "d\u0000\u0000\u0000"
                },
                "upper_bound": {
                    "bytes": "d\u0000\u0000\u0000"
                }
            }
        ]
    },
    "key_metadata": null
}

Which also looks good. The snapshot:

{
    "status": 1,
    "snapshot_id": null,
    "sequence_number": null,
    "file_sequence_number": null,
    "data_file": {
        "content": 0,
        "file_path": "s3://icebergdata/demo/iceberg/rust/append_partition_data_file_test/data/test-00000.parquet",
        "file_format": "PARQUET",
        "partition": {
            "id": {
                "int": 100
            }
        },
        "record_count": 2,
        "file_size_in_bytes": 1160,
        "column_sizes": {
            "array": [
                {
                    "key": 3,
                    "value": 36
                },
                {
                    "key": 1,
                    "value": 74
                },
                {
                    "key": 2,
                    "value": 55
                }
            ]
        },
        "value_counts": {
            "array": [
                {
                    "key": 1,
                    "value": 2
                },
                {
                    "key": 3,
                    "value": 2
                },
                {
                    "key": 2,
                    "value": 2
                }
            ]
        },
        "null_value_counts": {
            "array": [
                {
                    "key": 2,
                    "value": 0
                },
                {
                    "key": 3,
                    "value": 0
                },
                {
                    "key": 1,
                    "value": 0
                }
            ]
        },
        "nan_value_counts": {
            "array": []
        },
        "lower_bounds": {
            "array": [
                {
                    "key": 2,
                    "value": "d\u0000\u0000\u0000"
                },
                {
                    "key": 3,
                    "value": "\u0000"
                },
                {
                    "key": 1,
                    "value": "foo1"
                }
            ]
        },
        "upper_bounds": {
            "array": [
                {
                    "key": 1,
                    "value": "foo2"
                },
                {
                    "key": 2,
                    "value": "d\u0000\u0000\u0000"
                },
                {
                    "key": 3,
                    "value": "\u0001"
                }
            ]
        },
        "key_metadata": {
            "bytes": ""
        },
        "split_offsets": {
            "array": [
                4
            ]
        },
        "equality_ids": {
            "array": []
        },
        "sort_order_id": null
    }
}

Probably we just want to set the key_metadata to null instead of empty bytes.

Fokko
Fokko previously approved these changes Dec 3, 2024
@feniljain
Copy link
Contributor Author

feniljain commented Dec 4, 2024

Hey @Fokko 👋🏻

Thanks a lot for checking up in detail! Can I take up both of the issues as both are related to this test itself? 😅

Also, slightly tangential, but I have a small idea 💡, do you think we can use snapshot based testing over these files for end-to-end tests? Snapshot based testing would allow us to not check a lot of fields in every test, and we can just compare + evolve snapshots as new features are added. I have seen the idea being used with great success in projects like rust-analyzer before, and crates like https://docs.rs/insta/latest/insta/ can help us set it up.

If you think this makes sense and I should create a new issue to discuss this, do let me know, will do that :)

@Fokko
Copy link
Contributor

Fokko commented Dec 6, 2024

Regarding the testing, I would like to invite @liurenjie1024 @Xuanwo and @ZENOTME to give an opinion on that :D I'm just tipping my toe into lake rust 🦀

@Fokko Fokko marked this pull request as ready for review December 6, 2024 14:35
@ZENOTME
Copy link
Contributor

ZENOTME commented Dec 6, 2024

Hey @Fokko 👋🏻

Thanks a lot for checking up in detail! Can I take up both of the issues as both are related to this test itself? 😅

Also, slightly tangential, but I have a small idea 💡, do you think we can use snapshot based testing over these files for end-to-end tests? Snapshot based testing would allow us to not check a lot of fields in every test, and we can just compare + evolve snapshots as new features are added. I have seen the idea being used with great success in projects like rust-analyzer before, and crates like https://docs.rs/insta/latest/insta/ can help us set it up.

If you think this makes sense and I should create a new issue to discuss this, do let me know, will do that :)

Thanks @feniljain! I try insta locally and I think this is a cool idea. This tool enables us to compare the field of snapshots conveniently. One thing we need to address maybe filter out some random field. A creative idea is to support Avro format files, allowing us to create snapshots of the entire Iceberg metadata, which can then be used for quick comparisons in end-to-end tests. I also agree to open an issue to discuss this.

Copy link
Member

@Xuanwo Xuanwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @feniljain for working on this and thank you @Fokko & @ZENOTME's review. Let's move!

@Xuanwo Xuanwo merged commit 7981def into apache:main Dec 14, 2024
16 checks passed
@feniljain feniljain deleted the iceberg_partition_test branch December 14, 2024 14:18
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Extend the e2e tests with append a partitioned file
4 participants