Skip to content

Bug-fix in Filter and Limit statistics #8094

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Merged
merged 5 commits into from
Nov 9, 2023
Merged

Bug-fix in Filter and Limit statistics #8094

merged 5 commits into from
Nov 9, 2023

Conversation

berkaysynnada
Copy link
Contributor

@berkaysynnada berkaysynnada commented Nov 8, 2023

Which issue does this PR close?

Related to #8078.

Rationale for this change

After enum Precision is introduced in DF, some bugs are discovered. This PR resolves 3 bugs:

  1. singleton check for ColumnStatistics,
  2. limit in datasource with num rows statistics,
  3. min - max exactness of columns with constant value.

What changes are included in this PR?

1st fix: A column is labeled as singleton only if its min and max values are exact and equal.
2nd fix: To stop processing, only exact count of rows is regarded. Otherwise, we should continue to process until range estimation of precision implemented.
3rd fix: During the analysis in filter statistics, if a column is filtered with a constant value (e.g. c=1), we set its min and max values as exact.

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added the core Core DataFusion crate label Nov 8, 2023
@berkaysynnada berkaysynnada marked this pull request as draft November 8, 2023 16:15
@berkaysynnada berkaysynnada marked this pull request as ready for review November 8, 2023 17:09
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the subtlety that may be involved here I think we should have test for these changes. Given that no existing test breaks, that suggests to me that it isn't sufficiently covered 🤔

@berkaysynnada
Copy link
Contributor Author

Given the subtlety that may be involved here I think we should have test for these changes. Given that no existing test breaks, that suggests to me that it isn't sufficiently covered 🤔

I have added tests covering the 1st and 3rd cases, but can't find a test suite for the 2nd case that I can demonstrate the need for the fix. Do you have any advice for me to do that easily?

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @berkaysynnada -- I think this is better for sure. I hope to keep improving test coverage as part of #8078

@@ -70,7 +70,11 @@ pub async fn get_statistics_with_limit(
// files. This only applies when we know the number of rows. It also
// currently ignores tables that have no statistics regarding the
// number of rows.
if num_rows.get_value().unwrap_or(&usize::MIN) <= &limit.unwrap_or(usize::MAX) {
let conservative_num_rows = match num_rows {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
core Core DataFusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants