Skip to content

Support DictionaryArray Parquet Data Page Statistics #11185

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Closed
Tracked by #10922
alamb opened this issue Jun 30, 2024 · 4 comments · Fixed by #11195
Closed
Tracked by #10922

Support DictionaryArray Parquet Data Page Statistics #11185

alamb opened this issue Jun 30, 2024 · 4 comments · Fixed by #11195
Assignees
Labels
good first issue Good for newcomers

Comments

@alamb
Copy link
Contributor

alamb commented Jun 30, 2024

Is your feature request related to a problem or challenge?

Part of #10922

We are adding APIs to efficiently convert the data stored in Parquet's "PageIndex" into ArrayRefs -- which will make it significiantly easier to use this information for pruning and other tasks.

Describe the solution you'd like

Add support to StatisticsConverter::min_page_statistics and StatisticsConverter::max_page_statistics for the types above

/// of parquet page [`Index`]'es to an [`ArrayRef`]
pub(crate) fn min_page_statistics<'a, I>(
data_type: Option<&DataType>,
iterator: I,
) -> Result<ArrayRef>
where
I: Iterator<Item = (usize, &'a Index)>,
{
get_data_page_statistics!(Min, data_type, iterator)
}
/// Extracts the max statistics from an iterator
/// of parquet page [`Index`]'es to an [`ArrayRef`]
pub(crate) fn max_page_statistics<'a, I>(
data_type: Option<&DataType>,
iterator: I,
) -> Result<ArrayRef>
where
I: Iterator<Item = (usize, &'a Index)>,
{

Describe alternatives you've considered

You can follow the model from @Weijun-H in #10931

  1. Update the test for the listed data types (I think it is test_binary) following the model of test_int64

    async fn test_int_64() {
    // This creates a parquet files of 4 columns named "i8", "i16", "i32", "i64"
    let reader = TestReader {
    scenario: Scenario::Int,
    row_per_group: 5,
    }
    .build()
    .await;
    // since each row has only one data page, the statistics are the same
    Test {
    reader: &reader,
    // mins are [-5, -4, 0, 5]
    expected_min: Arc::new(Int64Array::from(vec![-5, -4, 0, 5])),
    // maxes are [-1, 0, 4, 9]
    expected_max: Arc::new(Int64Array::from(vec![-1, 0, 4, 9])),
    // nulls are [0, 0, 0, 0]
    expected_null_counts: UInt64Array::from(vec![0, 0, 0, 0]),
    // row counts are [5, 5, 5, 5]
    expected_row_counts: UInt64Array::from(vec![5, 5, 5, 5]),
    column_name: "i64",
    check: Check::Both,
    }
    .run();

  2. Add any required implementation in

    make_data_page_stats_iterator!(MinInt64DataPageStatsIterator, min, Index::INT64, i64);
    make_data_page_stats_iterator!(MaxInt64DataPageStatsIterator, max, Index::INT64, i64);
    macro_rules! get_data_page_statistics {
    ($stat_type_prefix: ident, $data_type: ident, $iterator: ident) => {
    paste! {
    match $data_type {
    Some(DataType::Int64) => Ok(Arc::new(Int64Array::from_iter([<$stat_type_prefix Int64DataPageStatsIterator>]::new($iterator).flatten()))),
    _ => unimplemented!()
    }
    }
    }
    (follow the model of the row counts, )

Additional context

No response

@dharanad
Copy link
Contributor

take

@efredine
Copy link
Contributor

efredine commented Jul 1, 2024

@dharanad - I have a bit of time today and could pick up the data page stats for this one and/or the FixedSizeByteArray stats to unblock the remaining tasks in this epic but wouldn't bother if you're actively working on them.

@dharanad
Copy link
Contributor

dharanad commented Jul 1, 2024

Hello @efredine , sure thing. I will unassigned myself from this issue, you can pick this one up. I will continue my work on FixedSizeByteArray

@dharanad dharanad removed their assignment Jul 1, 2024
@efredine
Copy link
Contributor

efredine commented Jul 1, 2024

take

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
good first issue Good for newcomers
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants