[WIP] Add `ManifestEvaluator` to allow filtering of files in a table scan (Issue #152) #241

sdd · 2024-03-09T14:09:50Z

A PR to propose an initial design for a partition evaluator to reject files in the table scan's file planner that don't match the scan's filter. This makes use of the Bind functionality added in #231 and assumes that the Transform::project method from #269 is in place.

Still To Do After This PR:

Replace placeholder code in ManifestEvalVisitor when it hits unary / binary / set predicates
Caching in InclusiveProjection::visit_bound_predicate

Which issue does this PR hope to close?

Rationale for this change

Add the ability to use the TableScan's filter in the file planner to filter out files whose partition values don't match the filter.

What changes are included in this PR?

Add an optional Predicate to TableScan, to be used to filter the manifests (and eventually data files) used by the scan
Add a case_sensitive flag to TableScan to allow case-sensitivity of filtering to be selectable
Add PartitionEvaluator: a new struct that handles evaluating whether manifests should be filtered out of a scan
Creates a cache of PartitionEvaluators, using its entries to filter out manifest entries that don't pass the scan's filter.
Extends Predicate with AlwaysTrue and AlwaysFalse to complement the equivalents in BoundPredicate
Implements ManifestEvalVisitor to apply a partition filter predicate to a ManifestFile's partition field summary
Add InclusiveProjection, similar to the python implementation, which projects a BoundPredicate into a Predicate that applies to partitions.
Add a field_id accessor to BoundPredicate Expressions.

Are these changes tested?

Not yet. Unit and integration tests to follow with the implementation, once the design has been approved.

Are there any user-facing changes?

Not yet

sdd · 2024-03-09T14:13:17Z

@liurenjie1024 PTAL: Your thoughts on this very much appreciated. This also depends on your predicate binding PR, #231, getting merged.

@Fokko @Xuanwo @ZENOTME @JanKaul would also appreciate any of your thoughts also.

liurenjie1024 · 2024-03-11T02:11:40Z

Hi, @sdd Thanks for this pr. The filtering process in fact consists of two steps:

Filter manifest files in manifest list. This step is relative easy to do, and has no dependency, see python implementation.
Implement partition pruning. This step is relative complicated, since it has several dependencies, such as inclusive projections, data file evaluator, you can see python implementation.

My suggestion is to finish ManifestEvalVisitor first since it's relative easy and has no dependency.

sdd · 2024-03-11T07:02:24Z

Sounds good @liurenjie1024 - will do.

crates/iceberg/src/scan.rs

sdd · 2024-03-17T13:35:39Z

crates/iceberg/src/expr/predicate.rs

@@ -599,6 +635,14 @@ impl Display for BoundPredicate {
    }
 }

+pub(crate) trait PredicateVisitor<T> {


I don't think these are needed anymore.

sdd · 2024-03-17T14:01:16Z

@liurenjie1024 @Xuanwo @ZENOTME: I've updated this PR, details in the updated description at the top. It builds on the Bind implementation that recently got merged and is all set to make use of the Transform::project implementation in #269.

There are a few issues that I could do with some help on how best to get them resolved, mentioned in the "Unsolved Problems" section of the description at the top if this PR.

crates/iceberg/src/scan.rs

ZENOTME · 2024-03-18T15:26:06Z

crates/iceberg/src/scan.rs

+            AlwaysTrue => true,
+            BoundPredicate::AlwaysFalse => false,
+            BoundPredicate::And(expr) => {
+                self.visit(expr.inputs()[0], partitions) && self.visit(expr.inputs()[1], partitions)


Maybe we can have something like fn eval(&self, Vec<Literal>)->bool for the Predicate.

Issue: 152

…ed in build()

sdd · 2024-03-23T23:30:38Z

@liurenjie1024 are you able to take a look at this again please when you have time? I've resolved the outstanding issues with binding and added accessors to address the missing functionality.

… rewrite_not call

marvinlanhenke

@sdd
Thanks again for your work. I really like the overall structure and the work you've already done here! However, I left some comments about some design decisions, where I think we can improve - or at least I'm unsure if there isn't a better way.

marvinlanhenke · 2024-04-02T04:23:31Z

crates/iceberg/src/scan.rs

+        partition_filter: Predicate,
+        case_sensitive: bool,
+    ) -> crate::Result<Self> {
+        let partition_filter = partition_filter.bind(partition_schema.clone(), case_sensitive)?;


I think we have rewrite_not already in place to be applied to partition_filter?

crates/iceberg/src/scan.rs

marvinlanhenke · 2024-04-02T06:07:41Z

crates/iceberg/src/scan.rs

-            .unwrap();
+/// Evaluates manifest files to see if their partition values comply with a filter predicate
+pub struct PartitionEvaluator {
+    manifest_eval_visitor: ManifestEvalVisitor,


I don't think we should nest ManifestEvalVisitor inside PartitionEvaluator since both structs have separate responsibilities and could exists independently (I shouldn't need a ManifestEvalVisitor to do partition pruning).

Also I think the implementation of PartitionEvaluator is missing [reference]?
It should visit each DataFiles struct and evaluate the predicate.

marvinlanhenke · 2024-04-02T06:14:18Z

crates/iceberg/src/scan.rs

+                            .or_insert_with_key(|key| self.create_partition_evaluator(key, filter));
+
+                    // reject any manifest files whose partition values don't match the filter.
+                    if !partition_evaluator.filter_manifest_file(entry) {


I think we should apply ManifestEvaluator independently from PartitionEvaluator (see comment down below in struct PartitionEvaluator).

Also, I'm not sure if we can simply 'skip' the file with continue. In the python impl the Partition- and MetricsEvaluator are applied on each DataFile on multiple threads - so I believe we have to collect all relevant manifest files before and then apply the Partition- and MetricsEvaluator on the complete (pre-filtered) collection, in order to split the work into multiple threads?

marvinlanhenke · 2024-04-02T06:25:28Z

@liurenjie1024 @ZENOTME @sdd @Xuanwo
I'd really appreciate your thoughts on this:

I took a closer look at the work @sdd has already done - and I think in order to proceed it would make sense to split the ManifestEvaluator, InclusiveProjection and PartitionEvaluator into separate modules.

I am thinking about putting each visitor into its own file within /iceberg/src/expr (perhaps even another subfolder /visitors) - for example: /iceberg/src/expr/manifest_evaluator.rs.

This way, we could create individual issues on each implementation - and work better in parallel.

Perhaps, it also makes sense to provide a trait to make sure each visitor adheres to the same interface
(although I'm not sure this is neccessary right now...)

// Pseudo-Example
pub trait BooleanVisitor:
  fn eval() -> boolean

sdd · 2024-04-02T07:03:20Z

Thanks for the review @marvinlanhenke ! I'll address the comments today or tomorrow morning 😁

sdd · 2024-04-02T07:06:08Z

Also I've got some uncommitted changes around having a boolean evaluator trait, similar to your suggestion. It has default implementations for and / or / not / always true / always false, but requires the others to be implemented by any struct implementing the trait.

liurenjie1024 · 2024-04-02T12:44:10Z

@liurenjie1024 @ZENOTME @sdd @Xuanwo I'd really appreciate your thoughts on this:

I took a closer look at the work @sdd has already done - and I think in order to proceed it would make sense to split the ManifestEvaluator, InclusiveProjection and PartitionEvaluator into separate modules.

I am thinking about putting each visitor into its own file within /iceberg/src/expr (perhaps even another subfolder /visitors) - for example: /iceberg/src/expr/manifest_evaluator.rs.

This way, we could create individual issues on each implementation - and work better in parallel.

Perhaps, it also makes sense to provide a trait to make sure each visitor adheres to the same interface (although I'm not sure this is neccessary right now...)
// Pseudo-Example
pub trait BooleanVisitor:
  fn eval() -> boolean

+1 for splitting these into split modules. Instead of a boolean visitor, I'm thinking about a post order predicate visitor:

pub trait BoundPredicateVisitor {
    
}

which is somehow motivated by @viirya 's pr, but move the travsering flow out of trait body.

marvinlanhenke · 2024-04-02T14:43:39Z

which is somehow motivated by @viirya 's pr, but move the travsering flow out of trait body.

@liurenjie1024
...so the traversal flow could then be implemented on e.g. the ManifestEvaluator itself.
For example, eval() could call the corresponding 'visit_xx' on the visitor that implements BoundPredicateVisitor. Is this what you mean?

liurenjie1024 · 2024-04-03T01:20:24Z

which is somehow motivated by @viirya 's pr, but move the travsering flow out of trait body.

@liurenjie1024 ...so the traversal flow could then be implemented on e.g. the ManifestEvaluator itself. For example, eval() could call the corresponding 'visit_xx' on the visitor that implements BoundPredicateVisitor. Is this what you mean?

I was thinking about following following structure:

pub trait BoundPredicateVisitor {
  type T;

  fn visit_and(&mut self, values: [Self::T; 2]) -> Result<Self::T>;
  fn visit_or(&mut self, values: [Self::T; 2]) -> Result<Self::T>;

  ...
}

pub fn visit_bound_predicate(visitor: &mut V, predicate: &BoundPredicate) -> Result<V::T> {
   match predicate {
    BoundPredicate::And(children) => {
      let ret = [visit_bound_predicate(visitor, children[0]),visit_bound_predicate(visitor, children[1])];
      visitor.visit_and(ret)
  },
  ...
 }
}

pub struct ManifestEvaluator {}

impl BoundPredicateVisitor  for ManifestEvaluator {}

impl ManifestEvaluator  {

  pub fn eval(&mut self, predicate: &BoundPredicate) -> bool {
    visit_bound_predicate(self, predicate)?
  } 
}

…e inclusive projection and manifest evaluator into own files

sdd · 2024-04-03T19:02:16Z

Hi @liurenjie1024 and @marvinlanhenke. I've updated this PR to align with the direction suggested above - thanks a lot for all the feedback!

This PR is clearly far too big to easily digest now though. I'll rebase it and resubmit it in pieces as separate PRs:

Accessors
BoundPredicateVisitor
InclusiveProjection
ManifestEvaluator itself
Misc changes such as adding AlwaysTrue / AlwaysFalse and any new getters
Changes to scan.rs that bring it all together

Unless you have any other suggestions. I'll happily close any of the raised PRs if they don't match the direction we agree to go with this.

marvinlanhenke

@sdd
Thanks again for your work here. I really think the split into modules was worth it. I also like your suggestion about splitting this into different PRs to make it more digestable. I left you some notes about the current implementation - which would be nice to consider in the following PRs.

marvinlanhenke · 2024-04-04T03:43:48Z

crates/iceberg/src/expr/visitors/bound_predicate_evaluator.rs

+use crate::Result;
+use fnv::FnvHashSet;
+
+pub trait BoundPredicateEvaluator {


I think, it'd be better to rename this to BoundPredicateVisitor

nit: Perhaps we can move the complete trait into /visitors/mod.rs? However, I am not sure about this - so I'd appreciate your thoughts on this.

cc @liurenjie1024

marvinlanhenke · 2024-04-04T03:45:59Z

crates/iceberg/src/expr/visitors/bound_predicate_evaluator.rs

+
+pub trait BoundPredicateEvaluator {
+    fn visit(&mut self, node: &BoundPredicate) -> Result<bool> {
+        match node {


according to the design suggestion by @liurenjie1024 link this function should live outside the trait as fn visit_bound_predicate. Other than that traversal logic looks fine.

marvinlanhenke · 2024-04-04T03:49:25Z

crates/iceberg/src/expr/visitors/inclusive_projection.rs

+    partition_spec: PartitionSpecRef,
+}
+
+impl InclusiveProjection {


I'll guess this will be done in the new PR - InclusiveProjection should impl pub trait BoundPredicateVisitor

marvinlanhenke · 2024-04-04T03:52:24Z

crates/iceberg/src/expr/visitors/inclusive_projection.rs

+        }
+    }
+
+    pub(crate) fn project(&self, predicate: &BoundPredicate) -> crate::Result<Predicate> {


nit: although it does align with the Python impl. I'd rename this to eval to keep it consistent across all visitors

marvinlanhenke · 2024-04-04T04:01:56Z

crates/iceberg/src/expr/visitors/manifest_evaluator.rs

+    case_sensitive: bool,
+}
+
+impl ManifestEvaluatorFactory {


Perhaps we can turn this into a proper Builder similar to the python function. I'm thinking ManifestEvaluatorBuilder as a derived TypedBuilder would probably be enough. This way we can provide table_schema, partition_spec and the row_filter and orchestrate the InclusiveProjection when we call eval on the ManifestEvaluator itself? What do you think?

marvinlanhenke · 2024-04-04T04:03:20Z

crates/iceberg/src/expr/visitors/manifest_evaluator.rs

+        })
+    }
+
+    pub(crate) fn evaluate(&self, manifest_file: &ManifestFile) -> crate::Result<bool> {


nit: rename to eval

sdd · 2024-04-09T06:48:14Z

Closing this PR as it has been split into #317, #320, #321, #322, and #323

Xuanwo reviewed Mar 11, 2024

View reviewed changes

crates/iceberg/src/scan.rs Outdated Show resolved Hide resolved

crates/iceberg/src/scan.rs Outdated Show resolved Hide resolved

waynexia mentioned this pull request Mar 13, 2024

Implement ManifestEvaluator to prune mainfest files. #253

Closed

sdd force-pushed the sdd/issue-152 branch 2 times, most recently from a066242 to b89648d Compare March 17, 2024 13:29

sdd commented Mar 17, 2024

View reviewed changes

sdd changed the title ~~[WIP] Add PartitionEvaluator to allow filtering of files in a table scan (Issue #152)~~ [WIP] Add ManifestEvaluator to allow filtering of files in a table scan (Issue #152) Mar 17, 2024

marvinlanhenke reviewed Mar 18, 2024

View reviewed changes

crates/iceberg/src/scan.rs Outdated Show resolved Hide resolved

ZENOTME reviewed Mar 18, 2024

View reviewed changes

sdd mentioned this pull request Mar 21, 2024

Modify Bind calls so that they don't consume self and instead return a new struct, leaving the original unmoved #290

Merged

sdd added 7 commits March 22, 2024 19:02

feat: add PartitionEvaluator

9839ef1

Issue: 152

feat(wip): add ManifestEvalVisitor and InclusiveProjection

9a0f80f

feat: evaluate ManifestFile partition values before loading its Manifest

91c2852

feat: avoid unneeded extra iter and await

53c6073

chore: fix some clippy lints

f270443

feat: implement field_id accessor for BoundReference Expressions

b8b6bf2

feat(wip): add accessors

380f4a0

sdd force-pushed the sdd/issue-152 branch from 6063459 to 380f4a0 Compare March 22, 2024 19:21

sdd added 4 commits March 22, 2024 19:29

fix: dont make predicate op pub(crate)"

47270d2

refactor: change TableScanBuilder case sensitivity setter

9156a56

chore: remove outdated comment and fix typo

22ac049

feat: redo accessors. Add field_id to accessor map to Schema, popualt…

7b1eff8

…ed in build()

feat: ensure visit/project are fallible, accessor returns option. Add…

56d935c

… rewrite_not call

marvinlanhenke reviewed Apr 2, 2024

View reviewed changes

sdd added 5 commits April 3, 2024 08:26

feat: add BoundPredicateEvaluator

7bf331d

feat: update accessors to use Arc and store their map as Arcs.

fbe09e9

fix: InclusiveProjection

35428c9

refactor: factor out PartitionEvaluator, use only ManifestEvaluator

c663a7d

refactor: add BoundPredicateEvaluator, use for ManifestEvaluator, mov…

ae8de9b

…e inclusive projection and manifest evaluator into own files

marvinlanhenke reviewed Apr 4, 2024

View reviewed changes

sdd closed this Apr 9, 2024

sdd mentioned this pull request Apr 16, 2024

add InclusiveProjection Visitor #335

Merged

sdd deleted the sdd/issue-152 branch August 2, 2024 06:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Add `ManifestEvaluator` to allow filtering of files in a table scan (Issue #152) #241

[WIP] Add `ManifestEvaluator` to allow filtering of files in a table scan (Issue #152) #241

sdd commented Mar 9, 2024 •

edited

Loading

sdd commented Mar 9, 2024

liurenjie1024 commented Mar 11, 2024

sdd commented Mar 11, 2024

sdd Mar 17, 2024

sdd commented Mar 17, 2024

ZENOTME Mar 18, 2024

sdd commented Mar 23, 2024

marvinlanhenke left a comment

marvinlanhenke Apr 2, 2024

marvinlanhenke Apr 2, 2024

marvinlanhenke Apr 2, 2024 •

edited

Loading

marvinlanhenke commented Apr 2, 2024

sdd commented Apr 2, 2024

sdd commented Apr 2, 2024

liurenjie1024 commented Apr 2, 2024

marvinlanhenke commented Apr 2, 2024 •

edited

Loading

liurenjie1024 commented Apr 3, 2024

sdd commented Apr 3, 2024

marvinlanhenke left a comment

marvinlanhenke Apr 4, 2024

marvinlanhenke Apr 4, 2024

marvinlanhenke Apr 4, 2024

marvinlanhenke Apr 4, 2024

marvinlanhenke Apr 4, 2024

marvinlanhenke Apr 4, 2024

marvinlanhenke Apr 4, 2024

marvinlanhenke Apr 4, 2024

sdd commented Apr 9, 2024

[WIP] Add ManifestEvaluator to allow filtering of files in a table scan (Issue #152) #241

[WIP] Add ManifestEvaluator to allow filtering of files in a table scan (Issue #152) #241

Conversation

sdd commented Mar 9, 2024 • edited Loading

Still To Do After This PR:

Which issue does this PR hope to close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

sdd commented Mar 9, 2024

liurenjie1024 commented Mar 11, 2024

sdd commented Mar 11, 2024

Choose a reason for hiding this comment

sdd commented Mar 17, 2024

Choose a reason for hiding this comment

sdd commented Mar 23, 2024

marvinlanhenke left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

marvinlanhenke Apr 2, 2024 • edited Loading

Choose a reason for hiding this comment

marvinlanhenke commented Apr 2, 2024

sdd commented Apr 2, 2024

sdd commented Apr 2, 2024

liurenjie1024 commented Apr 2, 2024

marvinlanhenke commented Apr 2, 2024 • edited Loading

liurenjie1024 commented Apr 3, 2024

sdd commented Apr 3, 2024

marvinlanhenke left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sdd commented Apr 9, 2024

[WIP] Add `ManifestEvaluator` to allow filtering of files in a table scan (Issue #152) #241

[WIP] Add `ManifestEvaluator` to allow filtering of files in a table scan (Issue #152) #241

sdd commented Mar 9, 2024 •

edited

Loading

marvinlanhenke Apr 2, 2024 •

edited

Loading

marvinlanhenke commented Apr 2, 2024 •

edited

Loading