-
Notifications
You must be signed in to change notification settings - Fork 209
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
[WIP] Add ManifestEvaluator
to allow filtering of files in a table scan (Issue #152)
#241
Conversation
Hi, @sdd Thanks for this pr. The filtering process in fact consists of two steps:
My suggestion is to finish ManifestEvalVisitor first since it's relative easy and has no dependency. |
Sounds good @liurenjie1024 - will do. |
a066242
to
b89648d
Compare
crates/iceberg/src/expr/predicate.rs
Outdated
@@ -599,6 +635,14 @@ impl Display for BoundPredicate { | |||
} | |||
} | |||
|
|||
pub(crate) trait PredicateVisitor<T> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think these are needed anymore.
@liurenjie1024 @Xuanwo @ZENOTME: I've updated this PR, details in the updated description at the top. It builds on the There are a few issues that I could do with some help on how best to get them resolved, mentioned in the "Unsolved Problems" section of the description at the top if this PR. |
PartitionEvaluator
to allow filtering of files in a table scan (Issue #152)ManifestEvaluator
to allow filtering of files in a table scan (Issue #152)
crates/iceberg/src/scan.rs
Outdated
AlwaysTrue => true, | ||
BoundPredicate::AlwaysFalse => false, | ||
BoundPredicate::And(expr) => { | ||
self.visit(expr.inputs()[0], partitions) && self.visit(expr.inputs()[1], partitions) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can have something like fn eval(&self, Vec<Literal>)->bool
for the Predicate
.
Issue: 152
@liurenjie1024 are you able to take a look at this again please when you have time? I've resolved the outstanding issues with binding and added accessors to address the missing functionality. |
… rewrite_not call
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sdd
Thanks again for your work. I really like the overall structure and the work you've already done here! However, I left some comments about some design decisions, where I think we can improve - or at least I'm unsure if there isn't a better way.
crates/iceberg/src/scan.rs
Outdated
partition_filter: Predicate, | ||
case_sensitive: bool, | ||
) -> crate::Result<Self> { | ||
let partition_filter = partition_filter.bind(partition_schema.clone(), case_sensitive)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we have rewrite_not
already in place to be applied to partition_filter?
crates/iceberg/src/scan.rs
Outdated
.unwrap(); | ||
/// Evaluates manifest files to see if their partition values comply with a filter predicate | ||
pub struct PartitionEvaluator { | ||
manifest_eval_visitor: ManifestEvalVisitor, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should nest ManifestEvalVisitor
inside PartitionEvaluator
since both structs have separate responsibilities and could exists independently (I shouldn't need a ManifestEvalVisitor to do partition pruning).
Also I think the implementation of PartitionEvaluator
is missing [reference]?
It should visit each DataFile
s struct and evaluate the predicate.
crates/iceberg/src/scan.rs
Outdated
.or_insert_with_key(|key| self.create_partition_evaluator(key, filter)); | ||
|
||
// reject any manifest files whose partition values don't match the filter. | ||
if !partition_evaluator.filter_manifest_file(entry) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should apply ManifestEvaluator
independently from PartitionEvaluator
(see comment down below in struct PartitionEvaluator).
Also, I'm not sure if we can simply 'skip' the file with continue. In the python impl the Partition- and MetricsEvaluator are applied on each DataFile on multiple threads - so I believe we have to collect all relevant manifest files before and then apply the Partition- and MetricsEvaluator on the complete (pre-filtered) collection, in order to split the work into multiple threads?
@liurenjie1024 @ZENOTME @sdd @Xuanwo I took a closer look at the work @sdd has already done - and I think in order to proceed it would make sense to split the I am thinking about putting each visitor into its own file within This way, we could create individual issues on each implementation - and work better in parallel. Perhaps, it also makes sense to provide a trait to make sure each visitor adheres to the same interface // Pseudo-Example
pub trait BooleanVisitor:
fn eval() -> boolean |
Thanks for the review @marvinlanhenke ! I'll address the comments today or tomorrow morning 😁 |
Also I've got some uncommitted changes around having a boolean evaluator trait, similar to your suggestion. It has default implementations for and / or / not / always true / always false, but requires the others to be implemented by any struct implementing the trait. |
+1 for splitting these into split modules. Instead of a boolean visitor, I'm thinking about a post order predicate visitor: pub trait BoundPredicateVisitor {
} which is somehow motivated by @viirya 's pr, but move the travsering flow out of trait body. |
@liurenjie1024 |
I was thinking about following following structure: pub trait BoundPredicateVisitor {
type T;
fn visit_and(&mut self, values: [Self::T; 2]) -> Result<Self::T>;
fn visit_or(&mut self, values: [Self::T; 2]) -> Result<Self::T>;
...
}
pub fn visit_bound_predicate(visitor: &mut V, predicate: &BoundPredicate) -> Result<V::T> {
match predicate {
BoundPredicate::And(children) => {
let ret = [visit_bound_predicate(visitor, children[0]),visit_bound_predicate(visitor, children[1])];
visitor.visit_and(ret)
},
...
}
}
pub struct ManifestEvaluator {}
impl BoundPredicateVisitor for ManifestEvaluator {}
impl ManifestEvaluator {
pub fn eval(&mut self, predicate: &BoundPredicate) -> bool {
visit_bound_predicate(self, predicate)?
}
} |
…e inclusive projection and manifest evaluator into own files
Hi @liurenjie1024 and @marvinlanhenke. I've updated this PR to align with the direction suggested above - thanks a lot for all the feedback! This PR is clearly far too big to easily digest now though. I'll rebase it and resubmit it in pieces as separate PRs:
Unless you have any other suggestions. I'll happily close any of the raised PRs if they don't match the direction we agree to go with this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sdd
Thanks again for your work here. I really think the split into modules was worth it. I also like your suggestion about splitting this into different PRs to make it more digestable. I left you some notes about the current implementation - which would be nice to consider in the following PRs.
use crate::Result; | ||
use fnv::FnvHashSet; | ||
|
||
pub trait BoundPredicateEvaluator { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think, it'd be better to rename this to BoundPredicateVisitor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Perhaps we can move the complete trait into /visitors/mod.rs? However, I am not sure about this - so I'd appreciate your thoughts on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
pub trait BoundPredicateEvaluator { | ||
fn visit(&mut self, node: &BoundPredicate) -> Result<bool> { | ||
match node { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
according to the design suggestion by @liurenjie1024 link this function should live outside the trait as fn visit_bound_predicate
. Other than that traversal logic looks fine.
partition_spec: PartitionSpecRef, | ||
} | ||
|
||
impl InclusiveProjection { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll guess this will be done in the new PR - InclusiveProjection
should impl pub trait BoundPredicateVisitor
} | ||
} | ||
|
||
pub(crate) fn project(&self, predicate: &BoundPredicate) -> crate::Result<Predicate> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: although it does align with the Python impl. I'd rename this to eval
to keep it consistent across all visitors
case_sensitive: bool, | ||
} | ||
|
||
impl ManifestEvaluatorFactory { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we can turn this into a proper Builder
similar to the python function. I'm thinking ManifestEvaluatorBuilder
as a derived TypedBuilder
would probably be enough. This way we can provide table_schema
, partition_spec
and the row_filter
and orchestrate the InclusiveProjection
when we call eval
on the ManifestEvaluator
itself? What do you think?
}) | ||
} | ||
|
||
pub(crate) fn evaluate(&self, manifest_file: &ManifestFile) -> crate::Result<bool> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: rename to eval
A PR to propose an initial design for a partition evaluator to reject files in the table scan's file planner that don't match the scan's filter. This makes use of the
Bind
functionality added in #231 and assumes that theTransform::project
method from #269 is in place.Still To Do After This PR:
ManifestEvalVisitor
when it hits unary / binary / set predicatesInclusiveProjection::visit_bound_predicate
Which issue does this PR hope to close?
ManifestEvaluator
to prune mainfest files. #253Rationale for this change
Add the ability to use the TableScan's filter in the file planner to filter out files whose partition values don't match the filter.
What changes are included in this PR?
Predicate
toTableScan
, to be used to filter the manifests (and eventually data files) used by the scancase_sensitive
flag toTableScan
to allow case-sensitivity of filtering to be selectablePartitionEvaluator
: a new struct that handles evaluating whether manifests should be filtered out of a scanPartitionEvaluator
s, using its entries to filter out manifest entries that don't pass the scan's filter.Predicate
withAlwaysTrue
andAlwaysFalse
to complement the equivalents inBoundPredicate
ManifestEvalVisitor
to apply a partition filter predicate to aManifestFile
's partition field summaryInclusiveProjection
, similar to the python implementation, which projects aBoundPredicate
into aPredicate
that applies to partitions.field_id
accessor toBoundPredicate
Expressions.Are these changes tested?
Not yet. Unit and integration tests to follow with the implementation, once the design has been approved.
Are there any user-facing changes?
Not yet