Skip to content

Prune pages are all null in ParquetExec by row_counts and fix NOT NULL prune #10051

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Merged
merged 10 commits into from
Apr 14, 2024

Conversation

Ted-Jiang
Copy link
Member

@Ted-Jiang Ted-Jiang commented Apr 12, 2024

and fix NOT NULL prune

Which issue does this PR close?

Closes #9961
Closes #10049.

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

@Ted-Jiang Ted-Jiang requested a review from alamb April 12, 2024 07:13
@github-actions github-actions bot added the core Core DataFusion crate label Apr 12, 2024
None
// see https://github.com/apache/arrow-rs/blob/91f0b1771308609ca27db0fb1d2d49571b3980d8/parquet/src/file/metadata.rs#L979-L982
let mut first_row_index = self
.col_offset_indexes
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can do this same calculation without allocating intermediate vec s-- something like this:

        let row_count_per_page = self
            .col_offset_indexes
            .windows(2)
            .map(|location| Some(location[1].first_row_index - location[0].first_row_index));

        Some(Arc::new(Int64Array::from_iter(row_count_per_page)))

🤔 the name col_offset_indexes is somewhat confusing to me as they are PageLocations --

    col_offset_indexes: &'a Vec<PageLocation>,

maybe we could rename that field to page_locations 🤔

Copy link
Member Author

@Ted-Jiang Ted-Jiang Apr 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb thanks to your always conscientious review and kindly suggestions . 👍

Avoid allocate in b2d3641

In my mind page_locations store the offsets in the column chunk , i think they are the same meaning 🤣

@Ted-Jiang Ted-Jiang changed the title Prune pages are all null in ParquetExec by row_counts Prune pages are all null in ParquetExec by row_counts and fix NOT NULL prune Apr 12, 2024
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Ted-Jiang -- this is looking good. I think it could be merged as is, though I also have several suggestions that I think will make the code and comments better. We can do so as a follow on PR though.

@@ -824,9 +863,17 @@ fn create_data_batch(scenario: Scenario) -> Vec<RecordBatch> {
}
Scenario::WithNullValues => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please also add the "some null/some not null" null values case to this scenario as well?

                make_int_batches_with_null(5, 1, 6),

}
Scenario::WithNullValuesPageLevel => {
vec![
make_int_batches_with_null(5, 1, 6),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this scenario should have at least one batch of all nulls (e.g. make_int_batches_with_nulls(5, 0, 0))

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think this will be prune by row group -> column level

Comment on lines +640 to +641
no_null_values_start: usize,
no_null_values_end: usize,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you might be able to simplify this method by sending in 2 parameters. Right now it looks like it interleaves nulls arbitrarily but really the nulls are always at the start and non nulls are at the end

  num_nulls: usize,
  non_nulls: usize

// row_group4: page1(1~5), page2(All Null)
// total 40 rows
async fn test_pages_with_null_values() {
test_prune(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It took me a while to convince myself that this was actually setting up the scenario as described. I eventually found it here:

https://github.com/alamb/arrow-datafusion/blob/dee926519030301f052dc2c3196e4fbef0da4c47/datafusion/core/tests/parquet/mod.rs#L916-L920

I wonder if it would be possible to add some better comments to test_prune mentioning that the created parquet files have 5 rows per page

Copy link
Member Author

@Ted-Jiang Ted-Jiang Apr 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i will make another pr to improve this tests

Ted-Jiang and others added 8 commits April 13, 2024 18:34
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
@Ted-Jiang Ted-Jiang merged commit 671cef8 into apache:main Apr 14, 2024
@alamb
Copy link
Contributor

alamb commented Apr 15, 2024

Thanks again @Ted-Jiang -- I took another look through this PR and it is very nice 👌

appletreeisyellow pushed a commit to influxdata/arrow-datafusion that referenced this pull request Apr 24, 2024
…L prune (apache#10051)

* Prune pages are all null in ParquetExec by row_counts
and fix NOT NULL prune

* fix clippy

* Update datafusion/core/src/physical_optimizer/pruning.rs

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

* Update datafusion/core/tests/parquet/page_pruning.rs

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

* Update datafusion/core/tests/parquet/page_pruning.rs

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

* Update datafusion/core/tests/parquet/page_pruning.rs

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

* Update datafusion/core/tests/parquet/page_pruning.rs

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

* remove allocate vec

* better way avoid allocate vec

* simply expr

---------

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
core Core DataFusion crate
Projects
None yet
2 participants