Skip to content

Adapt column statistics API #717

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Open
Dandandan opened this issue Jul 13, 2021 · 2 comments
Open

Adapt column statistics API #717

Dandandan opened this issue Jul 13, 2021 · 2 comments
Labels
enhancement New feature or request

Comments

@Dandandan
Copy link
Contributor

Dandandan commented Jul 13, 2021

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
While looking at adding support for more statistics on the Delta Lake TableProvider implementation I bumped into some limitation in our statistics API.

Currently columnstatistics is a Option<Vec<ColumnStatistics>>.

https://github.com/apache/arrow-datafusion/blob/master/datafusion/src/datasource/datasource.rs#L37

So, it should return the statistics by (correct) index regardless of the order in the files.

Describe the solution you'd like
Either:

  • Return a HashMap<String, ColumnStatistics> rather than a Option<Vec<ColumnStatistics>>
  • Pass a Schema parameter to TableProvider::statisitics so the positions of the fields can be calculated.

FWIW, Delta Lake / delta-rs takes the first approach and seems straightforward to implement and use.

Describe alternatives you've considered

Additional context

@Dandandan Dandandan added the enhancement New feature or request label Jul 13, 2021
@Dandandan
Copy link
Contributor Author

Closing, seeing this could be done with the schema on table provider instead.

@rdettai
Copy link
Contributor

rdettai commented Sep 13, 2021

@Dandandan in #965 I used the schema from the ExecutionPlan trait and it worked fine. But I do agree that it might be better to come up with at data structure that helps asserting that the column_statistics vector is well aligned on the schema fields vector (same size, same types...). I'm adding this as an item in #997, so if you want to close this for now that's fine by me 😃

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants