Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

feat: Support writing to not only 3/7 protocol #693

Merged
merged 5 commits into from
Feb 20, 2025

Conversation

nicklan
Copy link
Collaborator

@nicklan nicklan commented Feb 12, 2025

What changes are proposed in this pull request?

Our previous check was too strict. Now we just ensure that the protocol makes sense given what features are present/specified.

How was this change tested?

Made all existing write.rs tests also write to a protocol 1/1 table, and they all work.

@nicklan nicklan requested review from scovich and zachschuermann and removed request for scovich February 12, 2025 19:21
@@ -139,136 +149,174 @@ fn new_commit_info() -> DeltaResult<Box<ArrowEngineData>> {
Ok(Box::new(ArrowEngineData::new(commit_info_batch)))
}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like there are a lot of changes in this file, but mainly it's just:

  1. Allowingsetup to create a 3/7 or 1/1 table
  2. adding a new function to generate both types of tables and associated store/engine/location
  3. making each test iterate over those tables instead of just one

"operationParameters": {},
"engineCommitInfo": {
"engineInfo": "default engine"

Copy link
Collaborator Author

@nicklan nicklan Feb 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for basically every test I just:

  1. removed the line like: let table = create_table(store.clone(), table_location, schema, &[]).await?; and replaced it with for (table, engine, store, table_name) in setup_tables(schema, &[]).await? {
  2. updated any hard coded paths to rather use table_name

It looks like more changes, but if you hide whitespace you'll see it's minimal.

Copy link

codecov bot commented Feb 12, 2025

Codecov Report

Attention: Patch coverage is 45.45455% with 6 lines in your changes missing coverage. Please review.

Project coverage is 84.54%. Comparing base (484eb33) to head (5b3cfbc).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
kernel/src/actions/mod.rs 45.45% 6 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #693      +/-   ##
==========================================
- Coverage   84.56%   84.54%   -0.02%     
==========================================
  Files          75       75              
  Lines       17550    17553       +3     
  Branches    17550    17553       +3     
==========================================
  Hits        14841    14841              
- Misses       2000     2003       +3     
  Partials      709      709              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

"Tables with min writer version != 7 should not have table features.",
))
}
None => {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we still need the default case (but with a different error message).

Every writer feature > 1 introduces new capabilities kernel must support (cumulative), with no way to know whether the table actually uses those features. See the full list here.

Versions 2, 3, and 4 are especially problematic because they respectively require column invariants, check constraints, and generated cols -- none of which we could implement because they are underspecified and their implementation is highly spark-specific.

Writer version 6 is also problematic due to identity cols. We maybe could implement support for those, but it would be an annoying time sink -- even blind appends are affected.

So basically, we can only support writer versions 1 and 7.

Meanwhile, we can actually support all three reader versions, because 2 just requires column mapping which we already support.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might be able to partly work around this because some features have an obvious effect on the table's Metadata when they're active. We can probably get away with allowing the protocol version, as long as we can prove that the table doesn't use any unsupported features?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this does cover all cases. Note it's checking what writerFeatures is set to. So in the top two cases there is a writerFeatures in the protocol. If we're not on version 7, that's an error, if we are on 7, we validate the features as before.

The None case is then interesting. We could actually be on any version:

  • 7 is invalid because then there should be writer features, but there aren't.
  • 1 is valid, and we can write to it
  • all the rest are currently invalid.

I've updated the code to enforce that, but I got the sense DBR might go to (1,2) sometimes, so perhaps we should check and if it's 2 we can look in the metadata configuration and make sure it doesn't contain delta.invariants before continuing.

thoughts?

Copy link
Collaborator

@scovich scovich Feb 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree the cases are covered correctly now.

Re (1, 2) -- agree we need probably need to do that at some point. However, in theory the table only has (1, 2) because it is using invariants; otherwise why did it the table need to upgrade in the first place?

Good candidate for a follow-on PR tho.

Comment on lines 291 to 292
require!(
self.min_writer_version != 7,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't the first match arm guarantee this assertion can never fail?

Copy link
Collaborator Author

@nicklan nicklan Feb 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. This is the case where there was no writerFeatures specified in the protocol. So I've switched this to be a check that we're on version 1 only (but maybe we need 2, see other comment)

Comment on lines +164 to +165
let (store_37, engine_37, table_location_37) = setup("test_table_37", true);
let (store_11, engine_11, table_location_11) = setup("test_table_11", true);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do the different tables need different engines, out of curiosity?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We pass the storage to the engine, so because we want disjoint storage, we need to have two engines

@nicklan nicklan requested a review from scovich February 13, 2025 01:42
Copy link
Collaborator

@zachschuermann zachschuermann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM one nit!

require!(
self.min_writer_version == 1,
Error::unsupported(
"Currently delta-kernel-rs can only write to tables with min_writer_version = 1 or 7"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: just to match spec (or maybe do delta.minWriterVersion to match the user-facing table property?)

Suggested change
"Currently delta-kernel-rs can only write to tables with min_writer_version = 1 or 7"
"Currently delta-kernel-rs can only write to tables with protocol.minWriterVersion = 1 or 7"

@nicklan nicklan changed the title Support writing to not only 3/7 protocol feat: Support writing to not only 3/7 protocol Feb 20, 2025
@github-actions github-actions bot added the breaking-change Change that will require a version bump label Feb 20, 2025
@nicklan nicklan merged commit e2bcd0b into delta-io:main Feb 20, 2025
19 of 21 checks passed
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
breaking-change Change that will require a version bump
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants