-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Bug-fix in Filter and Limit statistics #8094
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the subtlety that may be involved here I think we should have test for these changes. Given that no existing test breaks, that suggests to me that it isn't sufficiently covered 🤔
I have added tests covering the 1st and 3rd cases, but can't find a test suite for the 2nd case that I can demonstrate the need for the fix. Do you have any advice for me to do that easily? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @berkaysynnada -- I think this is better for sure. I hope to keep improving test coverage as part of #8078
@@ -70,7 +70,11 @@ pub async fn get_statistics_with_limit( | |||
// files. This only applies when we know the number of rows. It also | |||
// currently ignores tables that have no statistics regarding the | |||
// number of rows. | |||
if num_rows.get_value().unwrap_or(&usize::MIN) <= &limit.unwrap_or(usize::MAX) { | |||
let conservative_num_rows = match num_rows { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Which issue does this PR close?
Related to #8078.
Rationale for this change
After
enum Precision
is introduced in DF, some bugs are discovered. This PR resolves 3 bugs:ColumnStatistics
,What changes are included in this PR?
1st fix: A column is labeled as singleton only if its min and max values are exact and equal.
2nd fix: To stop processing, only exact count of rows is regarded. Otherwise, we should continue to process until range estimation of precision implemented.
3rd fix: During the analysis in filter statistics, if a column is filtered with a constant value (e.g. c=1), we set its min and max values as exact.
Are these changes tested?
Are there any user-facing changes?