Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

feat: Table metadata #29

Merged
merged 44 commits into from
Aug 21, 2023
Merged

feat: Table metadata #29

merged 44 commits into from
Aug 21, 2023

Conversation

JanKaul
Copy link
Collaborator

@JanKaul JanKaul commented Aug 10, 2023

This PR defines all structures necessary to represent Iceberg Table Metadata. The main focus lies on serialization and deserialization from JSON. Some functionality might need to be added later on.

@JanKaul
Copy link
Collaborator Author

JanKaul commented Aug 10, 2023

I will add some more tests.

Copy link
Member

@Xuanwo Xuanwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Others LGTM, thanks for you hard work.

crates/iceberg/src/spec/mod.rs Outdated Show resolved Hide resolved
Copy link
Collaborator

@liurenjie1024 liurenjie1024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very great work, thanks @JanKaul About the integration test, are you planning to do it in following pr or in this pr?

crates/iceberg/src/spec/schema.rs Outdated Show resolved Hide resolved
crates/iceberg/src/spec/schema.rs Outdated Show resolved Hide resolved
crates/iceberg/src/spec/mod.rs Outdated Show resolved Hide resolved
crates/iceberg/src/spec/snapshot.rs Outdated Show resolved Hide resolved
crates/iceberg/src/spec/snapshot.rs Outdated Show resolved Hide resolved
crates/iceberg/src/spec/snapshot.rs Outdated Show resolved Hide resolved
crates/iceberg/src/spec/table_metadata.rs Outdated Show resolved Hide resolved
crates/iceberg/src/spec/table_metadata.rs Outdated Show resolved Hide resolved
crates/iceberg/src/spec/table_metadata.rs Outdated Show resolved Hide resolved
crates/iceberg/src/spec/table_metadata.rs Outdated Show resolved Hide resolved
@JanKaul
Copy link
Collaborator Author

JanKaul commented Aug 11, 2023

By integration tests you mean reading and writing an actual metadata.json file?

@liurenjie1024
Copy link
Collaborator

By integration tests you mean reading and writing an actual metadata.json file?

Yes, I mean the files quoted by @Fokko in #28

@JanKaul
Copy link
Collaborator Author

JanKaul commented Aug 12, 2023

I can include it here. Where should I place the files for testing? Should I create a folder at the workspace level?

@liurenjie1024
Copy link
Collaborator

liurenjie1024 commented Aug 12, 2023

I can include it here. Where should I place the files for testing? Should I create a folder at the workspace level?

I see others projects usually put a 'testdata' folder alongside 'src' folder. That's similar to

'''
crates
....iceberg
........src
........testdata
'''

cc @Xuanwo any other suggestions?

Copy link
Collaborator

@liurenjie1024 liurenjie1024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are almost there!

crates/iceberg/src/spec/snapshot.rs Outdated Show resolved Hide resolved
crates/iceberg/src/spec/table_metadata.rs Outdated Show resolved Hide resolved
@liurenjie1024
Copy link
Collaborator

cc @Fokko PTAL

#[derive(Debug, Serialize, Deserialize, PartialEq, Eq, Clone)]
#[serde(rename_all = "kebab-case")]
/// Names and types of fields in a table.
pub(crate) struct SchemaV1 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need the SchemaV1. What we do in PyIceberg is when reading a V1, and when encounter a Schema in the schema field:

image

We just set the schema-id to 0, and add the schema to the schema's field. We consider the metadata equivalent.

Copy link
Collaborator Author

@JanKaul JanKaul Aug 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SchemaV1 (and SchemaV2) are internal structs and are not visible to a user of the library. SchemaV1 is just used for serialization/deserialization. We can still do your recommended conversion.

The only publicly visible struct for a schema is the Schema struct, which has the same representation for v1 and v2 tables.

type Error = Error;
fn try_from(value: SchemaV1) -> Result<Self> {
Schema::builder()
.with_schema_id(value.schema_id.unwrap_or(DEFAULT_SCHEMA_ID))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like it isn't set to null?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is called when deserializing a v1 schema into the general Schema struct. If the v1 schema doesn't have a schema id, we assign a default schema_id on read.

Copy link
Collaborator

@liurenjie1024 liurenjie1024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @JanKaul

#[derive(Debug, Serialize, Deserialize, PartialEq, Eq)]
#[serde(rename_all = "kebab-case")]
/// A snapshot represents the state of a table at some time and is used to access the complete set of data files in the table.
pub(crate) struct SnapshotV2 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, I would combine the V1 and V2. The sequence_number is added later on, and there is some logic to set it afterward:

def _inherit_sequence_number(entry: ManifestEntry, manifest: ManifestFile) -> ManifestEntry:
    """Inherits the sequence numbers.

    More information in the spec: https://iceberg.apache.org/spec/#sequence-number-inheritance

    Args:
        entry: The manifest entry that has null sequence numbers.
        manifest: The manifest that has a sequence number.

    Returns:
        The manifest entry with the sequence numbers set.
    """
    # The snapshot_id is required in V1, inherit with V2 when null
    if entry.snapshot_id is None:
        entry.snapshot_id = manifest.added_snapshot_id

    # in v1 tables, the data sequence number is not persisted and can be safely defaulted to 0
    # in v2 tables, the data sequence number should be inherited iff the entry status is ADDED
    if entry.data_sequence_number is None and (manifest.sequence_number == 0 or entry.status == ManifestEntryStatus.ADDED):
        entry.data_sequence_number = manifest.sequence_number

    # in v1 tables, the file sequence number is not persisted and can be safely defaulted to 0
    # in v2 tables, the file sequence number should be inherited iff the entry status is ADDED
    if entry.file_sequence_number is None and (manifest.sequence_number == 0 or entry.status == ManifestEntryStatus.ADDED):
        # Only available in V2, always 0 in V1
        entry.file_sequence_number = manifest.sequence_number

    return entry

This can happen when deserializing the JSON, or later on (like we do in PyIceberg).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't quite understand, the spec says it's required in v2?
image

The inheritance I think it's for manifest? But snapshot should have a sequence number in its json?

parent_snapshot_id: v2.parent_snapshot_id,
sequence_number: v2.sequence_number,
timestamp_ms: v2.timestamp_ms,
manifest_list: match v2.manifest_list {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, in PyIceberg we don't check for the manifests field. cc @rdblue

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The spec says it's required in v2?
image

}

#[test]
fn test_table_data_v1() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would recommend making a very minimal v1 spec, where schema is present, but schemas is missing. Same with partition-spec and partition-specs missing. And for sort-order etc.

@Fokko
Copy link
Contributor

Fokko commented Aug 17, 2023

Left some comments, great work @JanKaul 🚀

@liurenjie1024
Copy link
Collaborator

Hi, @JanKaul I would suggest to add integration tests with json data in following pr. This pr is a little to large for me.

@liurenjie1024
Copy link
Collaborator

It seems that the V1/V2 suffix is not clear enough to show that they are only used for making ser/de easier to write, and caused some misunderstanding for reviewers. I would suggest two improvements:

  1. Move structs for serde into private modules such as _serde
  2. Add comments to explain that these are only for format conversion, not user facing api.

cc @JanKaul @Fokko

@JanKaul
Copy link
Collaborator Author

JanKaul commented Aug 18, 2023

Good idea, thank you for your great comments!

@liurenjie1024
Copy link
Collaborator

To add some background here about the design philosophy here for reviews not familiar with rust:

  1. All structs with the suffix V1/V2 are used for making serializtion/deserialization easier to maintain. It's sth like handwritten schema definition of specs, and they will be discard after reading from/writing to disk file, and it's not user facing. Unlike java/python, rust has no runtime reflection, and the serializtion/deserialization codes are generated in compile time.

  2. About access modifiers. pub in rust is similar to public in java, which means accessible to code outside of package. pub(crate) is similar to default access modifier, which is only visible to codes in same package.

We have a discussion about the overall structure de# #2 #3

cc @Fokko Hope this comment can help you understand it better.

Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this @JanKaul and @liurenjie1024 for teaching me on Rust, appreciate it!

This looks good, thanks!

.with_partition_field(PartitionField {
name: "ts_day".to_string(),
transform: Transform::Day,
source_id: 4,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Idk what the best place for Rust is to do validation, but in this case, source id 4 does not exist in the current schema.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, it is probably best to do it during deserialization. We should add it in another PR.

@Fokko Fokko merged commit bdc66a0 into apache:main Aug 21, 2023
7 checks passed
@JanKaul JanKaul deleted the table-metadata branch March 4, 2024 15:06
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants