Skip to content

Minor: Update min_statistics and max_statistics to be helpers, update docs #10866

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Merged
merged 1 commit into from
Jun 12, 2024

Conversation

alamb
Copy link
Contributor

@alamb alamb commented Jun 11, 2024

Which issue does this PR close?

Part of #10453
Inspired by #10852

Rationale for this change

The new StatisticsConverter API is ready, so we can remove the old min_statistics and max_statistics from "crate" level APIs

What changes are included in this PR?

  1. Make the old min_statistics and max_statistics APIs module private
  2. Update module docs

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added the core Core DataFusion crate label Jun 11, 2024
Copy link
Contributor

@marvinlanhenke marvinlanhenke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb
Thanks LGTM.

I'm currently thinking about the following, which might be a refactor to consider once we support Data Pages as well?

In order to have less code duplication, we could probably introduce an enum and match in a single get_statistic function (or macro like we already have). WDYT?

enum StatisticPrefix {
    Min,
    Max,
   // ... and the rest (e.g NullCount, RowCount)
}

enum StatisticType<I> {
    RowGroup{iter: I, prefix: StatisticPrefix},
    DataPage{iter: I, prefix: StatisticPrefix},
   // ... do we have more to support?
}

// This would be a macro like we already have in place
fn get_statistics<I: Iterator>(data_type: &DataType, statistic_type: StatisticType<I>) -> Result<ArrayRef>{
    match data_type {
        DataType::Boolean => {
            match statistic_type {
                StatisticType::RowGroup{iter, prefix} => {
                    // create boolean array from iter
                    // using prefix to create e.g. `MinBooleanRowGroupStatsIterator::new(iter)`
                },
                // do the same for StatisticType::DataPage
                _ => unimplemented!()
            }
        }
       // support all the other data_types
        _ => unimplemented!()
    }
} 

Then in the StatisticsConverter we could...

// impl StatisticConverter
// ...

fn row_group_mins<I>(&self, metadatas: I) -> Result<ArrayRef>
where
    I: IntoIterator<Item = &'a RowGroupMetaData>,
{
    let data_type = self.arrow_field.data_type();

    let Some(parquet_index) = self.parquet_index else {
        return Ok(self.make_null_array(data_type, metadatas));
    };

    let iter = metadatas
        .into_iter()
        .map(|x| x.column(parquet_index).statistics());
    
    let statistic_type = StatisticType::RowGroup{iter, prefix: StatisticPrefix::Min};
    
    get_statistics(data_type, statistic_type);
}

Copy link
Member

@jonahgao jonahgao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

Copy link
Contributor

@comphead comphead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm thanks @alamb

@comphead comphead merged commit 73381fe into apache:main Jun 12, 2024
23 checks passed
findepi pushed a commit to findepi/datafusion that referenced this pull request Jul 16, 2024
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
core Core DataFusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants