-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Prune pages are all null in ParquetExec by row_counts and fix NOT NULL prune #10051
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Conversation
and fix NOT NULL prune
None | ||
// see https://github.com/apache/arrow-rs/blob/91f0b1771308609ca27db0fb1d2d49571b3980d8/parquet/src/file/metadata.rs#L979-L982 | ||
let mut first_row_index = self | ||
.col_offset_indexes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can do this same calculation without allocating intermediate vec s-- something like this:
let row_count_per_page = self
.col_offset_indexes
.windows(2)
.map(|location| Some(location[1].first_row_index - location[0].first_row_index));
Some(Arc::new(Int64Array::from_iter(row_count_per_page)))
🤔 the name col_offset_indexes
is somewhat confusing to me as they are PageLocation
s --
col_offset_indexes: &'a Vec<PageLocation>,
maybe we could rename that field to page_locations
🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Ted-Jiang -- this is looking good. I think it could be merged as is, though I also have several suggestions that I think will make the code and comments better. We can do so as a follow on PR though.
@@ -824,9 +863,17 @@ fn create_data_batch(scenario: Scenario) -> Vec<RecordBatch> { | |||
} | |||
Scenario::WithNullValues => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please also add the "some null/some not null" null values case to this scenario as well?
make_int_batches_with_null(5, 1, 6),
} | ||
Scenario::WithNullValuesPageLevel => { | ||
vec![ | ||
make_int_batches_with_null(5, 1, 6), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this scenario should have at least one batch of all nulls (e.g. make_int_batches_with_nulls(5, 0, 0)
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think this will be prune by row group -> column level
no_null_values_start: usize, | ||
no_null_values_end: usize, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you might be able to simplify this method by sending in 2 parameters. Right now it looks like it interleaves nulls arbitrarily but really the nulls are always at the start and non nulls are at the end
num_nulls: usize,
non_nulls: usize
// row_group4: page1(1~5), page2(All Null) | ||
// total 40 rows | ||
async fn test_pages_with_null_values() { | ||
test_prune( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It took me a while to convince myself that this was actually setting up the scenario as described. I eventually found it here:
I wonder if it would be possible to add some better comments to test_prune
mentioning that the created parquet files have 5 rows per page
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i will make another pr to improve this tests
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Thanks again @Ted-Jiang -- I took another look through this PR and it is very nice 👌 |
…L prune (apache#10051) * Prune pages are all null in ParquetExec by row_counts and fix NOT NULL prune * fix clippy * Update datafusion/core/src/physical_optimizer/pruning.rs Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * Update datafusion/core/tests/parquet/page_pruning.rs Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * Update datafusion/core/tests/parquet/page_pruning.rs Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * Update datafusion/core/tests/parquet/page_pruning.rs Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * Update datafusion/core/tests/parquet/page_pruning.rs Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * remove allocate vec * better way avoid allocate vec * simply expr --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
and fix NOT NULL prune
Which issue does this PR close?
Closes #9961
Closes #10049.
Rationale for this change
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?