-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Support Boolean
Parquet Data Page Statistics
#11027
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Comments
Sorry for the noise |
Actually I don't think this is actually done This ticket covers extracting DataPage statistics (not row group statistics, which are annoyingly different in parquet 🤯 ) The data page statistics are extracted here datafusion/datafusion/core/src/datasource/physical_plan/parquet/statistics.rs Lines 612 to 627 in 18042fd
In order to complete this issue, we need to change
to check: Check::Both, And make the tests pass |
Oh sorry, that was stupid of me. |
No worries at all -- this stuff is tricky |
Yeah, all those similar names do get to me sometimes... On another note, I tried to implement this like all the others did, but the test fails with :
The implementation is like this: make_data_page_stats_iterator!(
MinBooleanDataPageStatsIterator,
|x: &PageIndex<bool>| { x.min },
Index::BOOLEAN,
bool
);
make_data_page_stats_iterator!(
MaxBooleanDataPageStatsIterator,
|x: &PageIndex<bool>| { x.max },
Index::BOOLEAN,
bool
);
...
macro_rules! get_data_page_statistics {
($stat_type_prefix: ident, $data_type: ident, $iterator: ident) => {
paste! {
match $data_type {
Some(DataType::Boolean) => Ok(Arc::new(
BooleanArray::from_iter(
[<$stat_type_prefix BooleanDataPageStatsIterator>]::new($iterator).flatten()
)
)),
...
} These macros, functions, and tests jump around a lot before I get to the caller, which causes this panic. Do you or anyone else know why this happens? |
The iterator must be sized thing comes from arrow -- one workaround is to collect the values into a Vec first and then create the array I don't know why boolean is different than the other data page types 🤔 |
take |
Is your feature request related to a problem or challenge?
Part of #10922
We are adding APIs to efficiently convert the data stored in Parquet's "PageIndex" into
ArrayRef
s -- which will make it significantly easier to use this information for pruning and other tasks.Describe the solution you'd like
Add support to
StatisticsConverter::min_page_statistics
andStatisticsConverter::max_page_statistics
for the types abovedatafusion/datafusion/core/src/datasource/physical_plan/parquet/statistics.rs
Lines 637 to 656 in a923c65
Describe alternatives you've considered
You can follow the model from @Weijun-H in #10931
test_int64
datafusion/datafusion/core/tests/parquet/arrow_statistics.rs
Lines 506 to 529 in a923c65
datafusion/datafusion/core/src/datasource/physical_plan/parquet/statistics.rs
Lines 575 to 586 in 2f43476
datafusion/datafusion/core/src/datasource/physical_plan/parquet/statistics.rs
Line 90 in 2f43476
Additional context
No response
The text was updated successfully, but these errors were encountered: