Skip to content

Support Boolean Parquet Data Page Statistics #11027

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Closed
Tracked by #10922
alamb opened this issue Jun 20, 2024 · 8 comments · Fixed by #11054
Closed
Tracked by #10922

Support Boolean Parquet Data Page Statistics #11027

alamb opened this issue Jun 20, 2024 · 8 comments · Fixed by #11054
Assignees
Labels
enhancement New feature or request good first issue Good for newcomers

Comments

@alamb
Copy link
Contributor

alamb commented Jun 20, 2024

Is your feature request related to a problem or challenge?

Part of #10922

We are adding APIs to efficiently convert the data stored in Parquet's "PageIndex" into ArrayRefs -- which will make it significantly easier to use this information for pruning and other tasks.

Describe the solution you'd like

Add support to StatisticsConverter::min_page_statistics and StatisticsConverter::max_page_statistics for the types above

/// of parquet page [`Index`]'es to an [`ArrayRef`]
pub(crate) fn min_page_statistics<'a, I>(
data_type: Option<&DataType>,
iterator: I,
) -> Result<ArrayRef>
where
I: Iterator<Item = (usize, &'a Index)>,
{
get_data_page_statistics!(Min, data_type, iterator)
}
/// Extracts the max statistics from an iterator
/// of parquet page [`Index`]'es to an [`ArrayRef`]
pub(crate) fn max_page_statistics<'a, I>(
data_type: Option<&DataType>,
iterator: I,
) -> Result<ArrayRef>
where
I: Iterator<Item = (usize, &'a Index)>,
{

Describe alternatives you've considered

You can follow the model from @Weijun-H in #10931

  1. Update the test for the listed data types following the model of test_int64
    async fn test_int_64() {
    // This creates a parquet files of 4 columns named "i8", "i16", "i32", "i64"
    let reader = TestReader {
    scenario: Scenario::Int,
    row_per_group: 5,
    }
    .build()
    .await;
    // since each row has only one data page, the statistics are the same
    Test {
    reader: &reader,
    // mins are [-5, -4, 0, 5]
    expected_min: Arc::new(Int64Array::from(vec![-5, -4, 0, 5])),
    // maxes are [-1, 0, 4, 9]
    expected_max: Arc::new(Int64Array::from(vec![-1, 0, 4, 9])),
    // nulls are [0, 0, 0, 0]
    expected_null_counts: UInt64Array::from(vec![0, 0, 0, 0]),
    // row counts are [5, 5, 5, 5]
    expected_row_counts: UInt64Array::from(vec![5, 5, 5, 5]),
    column_name: "i64",
    check: Check::Both,
    }
    .run();
  2. Add any required implementation in
    make_data_page_stats_iterator!(MinInt64DataPageStatsIterator, min, Index::INT64, i64);
    make_data_page_stats_iterator!(MaxInt64DataPageStatsIterator, max, Index::INT64, i64);
    macro_rules! get_data_page_statistics {
    ($stat_type_prefix: ident, $data_type: ident, $iterator: ident) => {
    paste! {
    match $data_type {
    Some(DataType::Int64) => Ok(Arc::new(Int64Array::from_iter([<$stat_type_prefix Int64DataPageStatsIterator>]::new($iterator).flatten()))),
    _ => unimplemented!()
    }
    }
    }
    (follow the model of the row counts, )

Additional context

No response

@LorrensP-2158466
Copy link
Contributor

Boolean type is supported, it was added in commit 9845e6e of PR #10711

@alamb alamb closed this as completed Jun 20, 2024
@alamb
Copy link
Contributor Author

alamb commented Jun 20, 2024

Sorry for the noise

@alamb
Copy link
Contributor Author

alamb commented Jun 21, 2024

Actually I don't think this is actually done

This ticket covers extracting DataPage statistics (not row group statistics, which are annoyingly different in parquet 🤯 )

The data page statistics are extracted here

macro_rules! get_data_page_statistics {
($stat_type_prefix: ident, $data_type: ident, $iterator: ident) => {
paste! {
match $data_type {
Some(DataType::UInt8) => Ok(Arc::new(
UInt8Array::from_iter(
[<$stat_type_prefix Int32DataPageStatsIterator>]::new($iterator)
.map(|x| {
x.into_iter().filter_map(|x| {
x.and_then(|x| u8::try_from(x).ok())
})
})
.flatten()
)
)),
Some(DataType::UInt16) => Ok(Arc::new(

In order to complete this issue, we need to change

to

 check: Check::Both, 

And make the tests pass

@alamb alamb reopened this Jun 21, 2024
@LorrensP-2158466
Copy link
Contributor

Oh sorry, that was stupid of me.

@alamb
Copy link
Contributor Author

alamb commented Jun 21, 2024

Oh sorry, that was stupid of me.

No worries at all -- this stuff is tricky

@LorrensP-2158466
Copy link
Contributor

Yeah, all those similar names do get to me sometimes...

On another note, I tried to implement this like all the others did, but the test fails with :

thread 'parquet::arrow_statistics::test_boolean' panicked at src/array/boolean_array.rs:407:33:
Iterator must be sized

The implementation is like this:

make_data_page_stats_iterator!(
    MinBooleanDataPageStatsIterator,
    |x: &PageIndex<bool>| { x.min },
    Index::BOOLEAN,
    bool
);
make_data_page_stats_iterator!(
    MaxBooleanDataPageStatsIterator,
    |x: &PageIndex<bool>| { x.max },
    Index::BOOLEAN,
    bool
);
...
macro_rules! get_data_page_statistics {
    ($stat_type_prefix: ident, $data_type: ident, $iterator: ident) => {
        paste! {
            match $data_type {
                Some(DataType::Boolean) => Ok(Arc::new(
                    BooleanArray::from_iter(
                        [<$stat_type_prefix BooleanDataPageStatsIterator>]::new($iterator).flatten()
                    )
                )),
       ...
}

These macros, functions, and tests jump around a lot before I get to the caller, which causes this panic. Do you or anyone else know why this happens?

@alamb
Copy link
Contributor Author

alamb commented Jun 21, 2024

The iterator must be sized thing comes from arrow -- one workaround is to collect the values into a Vec first and then create the array

I don't know why boolean is different than the other data page types 🤔

@LorrensP-2158466
Copy link
Contributor

take

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
enhancement New feature or request good first issue Good for newcomers
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants