Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

feat: Add Catalog API #54

Merged
merged 15 commits into from
Sep 21, 2023
Merged

feat: Add Catalog API #54

merged 15 commits into from
Sep 21, 2023

Conversation

Xuanwo
Copy link
Member

@Xuanwo Xuanwo commented Sep 1, 2023

This is a draft of catalog API to show my general ideas:

  • I left the complex TableUpdate for further PRs.
  • All struct only have definition and no function impemented.
  • The detailed behavior is not added yet.
  • This design is mainly modeled by Iceberg REST API with some my own understanding

Once this draft has been approved, we can start fill it will real implementations.

Design Idea

  • Clean
  • Easy to undetstand (both for using and implementing)
  • Optimized for rust developers
  • Async First

Signed-off-by: Xuanwo <github@xuanwo.io>
@Xuanwo
Copy link
Member Author

Xuanwo commented Sep 1, 2023

cc @liurenjie1024 @JanKaul @ZENOTME @Fokko, would you like to take a look?

Copy link
Collaborator

@liurenjie1024 liurenjie1024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally LGTM, left small suggestions.

crates/iceberg/src/catalog.rs Outdated Show resolved Hide resolved
crates/iceberg/src/catalog.rs Outdated Show resolved Hide resolved
crates/iceberg/src/catalog.rs Outdated Show resolved Hide resolved
crates/iceberg/src/catalog.rs Outdated Show resolved Hide resolved
crates/iceberg/src/catalog.rs Outdated Show resolved Hide resolved
crates/iceberg/src/catalog.rs Outdated Show resolved Hide resolved
Signed-off-by: Xuanwo <github@xuanwo.io>
Signed-off-by: Xuanwo <github@xuanwo.io>
Signed-off-by: Xuanwo <github@xuanwo.io>
Signed-off-by: Xuanwo <github@xuanwo.io>
Signed-off-by: Xuanwo <github@xuanwo.io>
Copy link
Collaborator

@liurenjie1024 liurenjie1024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@JanKaul
Copy link
Collaborator

JanKaul commented Sep 1, 2023

We haven't talked about Iceberg Views yet and I'm not sure if you would like to support them in the future. I'm actually really interested in using views.

I don't want to include view support in this PR but I would like to future proof the catalog design to easily support views in the future.

The REST catalog is in the process of adding view support. However, I have the feeling that the initial REST catalog API wasn't designed with views in mind. There is one issue that comes up with the current design that we might be able to avoid for our catalog design. I will try to explain the issue.

Imagine a query engine wants to perform a query like the following:

SELECT first_name, last_name, age FROM users WHERE age > 18;

Without the catalog information the query engine doesn't know whether users refers to a view or a table. So if the catalog only exposes a load_table and load_view operation, the query engine doesn't know which one to call. I would prefer a load_tablelike (the naming is not important) operation that returns an

enum TableLike {
    Table(Table),
    View(View)
}

All iceberg catalogs except for the REST catalog return a "table metadata location" and it would be easy to distinguish between Tables and Views based on the metadata.

Let me know what you think.

@Xuanwo
Copy link
Member Author

Xuanwo commented Sep 1, 2023

Thank you for sharing! The relationship between view and table is quite interesting. At Databend, we consider a view as a special type of table, and our engine can distinguish the differences and determine how to execute the SQL accordingly.

I'm wondering if it's possible to avoid exposing the view details at the Catalog level and instead leave them for the catalog implementer to decide.

@JanKaul
Copy link
Collaborator

JanKaul commented Sep 1, 2023

Converting from Table to TableLike is straightforward:

impl From<Table> for TableLike {
    fn from(value: Table) -> Self {
        TableLike::Table(value)
    }
}

So you would only need to call table.into().

For the table scan operation you would need to check whether your input is a table or a view and then execute the corresponding scan operator. As long as we don't have view support it could look like:

let table = if let TableLike::Table(table) = tablelike {
        Ok(table)
    } else {
        Err(Error::new(... , "Views are not supported yet"))
    }?;

@Xuanwo Xuanwo mentioned this pull request Sep 1, 2023
@Xuanwo
Copy link
Member Author

Xuanwo commented Sep 1, 2023

Hi, @JanKaul, I started an issue about this topic: #55

@liurenjie1024
Copy link
Collaborator

Hi, @JanKaul Good point for view. Except for view, I think there will be other entities like materialized view. So the question is about name resolution:

  1. Should we keep separate method for resolving tables, views, event materialized view?
fn load_table(&self, table_name: TableId) -> Result<Table>;
fn load_view(&self, view_name: ViewId) -> Result<View>;

I think it's still useful to keep separate methods for each kind of entity. For example when I want to execute dml like:

insert into t values(1), (2)

The query engine knows that it should load a table to check it.

  1. Should we have a method for resolving entities without knowing its entity type?

As @JanKaul said it's would be quite useful in query:

select * from t;

In this case the catalog api should provide sth like:

fn load_entity(&self, table: &TableId) -> Result<TableLike>
  1. Should we keep both methods or just load_entity?

Keep load_entity method only can solve problem in 1. But I still want to keep methods in 1 for several reasons:

  • The api is more user friendly.
  • This gives concrete implementation more chances to do optimization, especially in cases where many tables and views are defined.

@Xuanwo
Copy link
Member Author

Xuanwo commented Sep 1, 2023

Hi @liurenjie1024, I apologize for the confusion. I have redirected this discussion to #55.

Co-authored-by: Renjie Liu <liurenjie2008@gmail.com>
}

/// TableCreation represents the creation of a table in the catalog.
pub struct TableCreation {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a rustee 🦀 , so forgive me if this is a silly question. Why would you create a struct for this, and not just have an argument to create_table for each of the fields?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's more of a convention in Rust to pass values via structs.

Firstly, Rust structs have zero cost. Therefore, it is exactly the same for Rust to pass values via op.do(struct Abc {a, b}) or op.do(a, b). Additionally, unpacking a struct is also zero cost. Implementors can unpack the value from the struct when needed.

Considering all these reasons, I prefer to pass arguments in a struct to make it more readable and maintainable. This implementation will align with our design ideas:

  • Clean
  • Easy to undetstand (both for using and implementing)
  • Optimized for rust developers

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way, adding a new argument to a trait function is a breaking change. However, adding a new field to a struct can be compatible if it is given a default value.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also agree that we should use struct as method arguments rather than several field to avoid breaking changes when we need to add more arguments.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But it's better to provide a builder for argument?

Copy link
Member Author

@Xuanwo Xuanwo Sep 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But it's better to provide a builder for argument?

Yep, I plan to leave them in following PRs. I believe they are not conflicts with this PR.

crates/iceberg/src/table.rs Outdated Show resolved Hide resolved
Xuanwo and others added 5 commits September 14, 2023 12:25
Signed-off-by: Xuanwo <github@xuanwo.io>
Co-authored-by: Fokko Driesprong <fokko@apache.org>
Signed-off-by: Xuanwo <github@xuanwo.io>
Signed-off-by: Xuanwo <github@xuanwo.io>
Signed-off-by: Xuanwo <github@xuanwo.io>
@Fokko Fokko merged commit 13281d3 into apache:main Sep 21, 2023
6 checks passed
@Fokko
Copy link
Contributor

Fokko commented Sep 21, 2023

Thanks for working on this @Xuanwo and @JanKaul, @liurenjie1024 and @ZENOTME for the review 🙌

@zeodtr
Copy link

zeodtr commented Sep 22, 2023

@Xuanwo Hi,
Maybe it's too early, but I am trying to implement the Catalog API provided with this merge for my Iceberg-related Rust project.
But visibility issues prevent me from doing it.

For example, NamespaceIdent's field is private, so I cannot create a value of that type. In fact, almost all structs in catalog.rs have this visibility issue.

Since this merge is a draft, I can assume that catalog.rs is incomplete yet.
What I want to know is, how will you resolve this visibility issue so that I can pre-modify the source code for the (test-) implementation.

I guess there are two ways (BTW, I'm not fluent with Rust):

  1. Add public new functions and accessor functions.
  2. Make the fields public.
    I think that for trivial structs like the ones in catalog.rs it would be simpler to go to the way 2.

What is your plan?
Or, maybe I misunderstood something fundamental...

Thank you in advance.

@Xuanwo
Copy link
Member Author

Xuanwo commented Sep 22, 2023

Since this merge is a draft, I can assume that catalog.rs is incomplete yet.

Yep, this PR is just to discuss the API with community first.

  • Add public new functions and accessor functions.
  • Make the fields public.
    I think that for trivial structs like the ones in catalog.rs it would be simpler to go to the way 2.

I believe it is safe to make the fields public as users construct them directly. I will submit a PR for this change to be reviewed in due course. Feel free to join the discussion.

@zeodtr
Copy link

zeodtr commented Sep 26, 2023

@Xuanwo Hi,
I'm not sure, but wouldn't it be more natural for Catalog to manage the location of the table and the table metadata itself?
If so, the location field of the TableCreation struct would better be deleted, or at least made Optional.
(For now, I'm thinking about ignoring that field when test-implementing the function.)

Thanks.

@Xuanwo
Copy link
Member Author

Xuanwo commented Sep 26, 2023

at least made Optional.

Sounds like a good idea to me.

However, this PR has already been merged. What do you think about creating a new issue to initiate a discussion? I believe we can refine the catalog API in subsequent PRs.

@zeodtr
Copy link

zeodtr commented Sep 26, 2023

at least made Optional.

Sounds like a good idea to me.

However, this PR has already been merged. What do you think about creating a new issue to initiate a discussion? I believe we can refine the catalog API in subsequent PRs.

It would be nice to create a new issue for this.
But the scope of the issue should be decided. Should it be only for the location field or more general (for example, including member visibility issue)?
Perhaps you can decide the scope and create a new issue for that.

Thanks.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants