-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Extract Parquet statistics from Interval
column
#10801
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @marvinlanhenke -- this looks very nice 👌 -- I left some comments but I think it is quite close.
Note that there are some conflicts, likely with #10711 which merged a few hours ago
I will file a ticket upstream in arrow-rs to support writing statistics for intervals (thank you for the code pointers)
datafusion/core/src/datasource/physical_plan/parquet/statistics.rs
Outdated
Show resolved
Hide resolved
I filed apache/arrow-rs#5847. I didn't quite understand your suggestion about two tickets |
I was thinking about two tickets:
I think those are two distinct issues? |
@alamb thanks for the review. I have adressed your comments PTAL. |
Yes I think you are right -- any chance you can file a ticket in arrow-rs to track writing IntervalMonthDayNano? |
Sure, I can do that. I filed: apache/arrow-rs#5849 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me -- thank you very much @marvinlanhenke
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm @marvinlanhenke kudos for the test description
As per the format specification, it is incorrect to read statistics for interval columns, as there is no defined sort order - https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#interval |
@alamb |
I think removing it would be fine @marvinlanhenke -- thank you |
* feat: add make_batch + basic test
Which issue does this PR close?
Closes #10752.
Rationale for this change
Since parquet does not support statistics for
Interval
columns, this PR only prepares test cases and some necessary setup.Once parquet supports statistics, a follow-up PR will be needed to finish the implementation.
What changes are included in this PR?
get_statistic
match arm (stub)should_panic
Are these changes tested?
Are there any user-facing changes?
No.