Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

[WIP] Add ManifestEvaluator to allow filtering of files in a table scan (Issue #152) #241

Closed
wants to merge 17 commits into from

Conversation

sdd
Copy link
Contributor

@sdd sdd commented Mar 9, 2024

A PR to propose an initial design for a partition evaluator to reject files in the table scan's file planner that don't match the scan's filter. This makes use of the Bind functionality added in #231 and assumes that the Transform::project method from #269 is in place.

Still To Do After This PR:

  • Replace placeholder code in ManifestEvalVisitor when it hits unary / binary / set predicates
  • Caching in InclusiveProjection::visit_bound_predicate

Which issue does this PR hope to close?

Rationale for this change

Add the ability to use the TableScan's filter in the file planner to filter out files whose partition values don't match the filter.

What changes are included in this PR?

  • Add an optional Predicate to TableScan, to be used to filter the manifests (and eventually data files) used by the scan
  • Add a case_sensitive flag to TableScan to allow case-sensitivity of filtering to be selectable
  • Add PartitionEvaluator: a new struct that handles evaluating whether manifests should be filtered out of a scan
  • Creates a cache of PartitionEvaluators, using its entries to filter out manifest entries that don't pass the scan's filter.
  • Extends Predicate with AlwaysTrue and AlwaysFalse to complement the equivalents in BoundPredicate
  • Implements ManifestEvalVisitor to apply a partition filter predicate to a ManifestFile's partition field summary
  • Add InclusiveProjection, similar to the python implementation, which projects a BoundPredicate into a Predicate that applies to partitions.
  • Add a field_id accessor to BoundPredicate Expressions.

Are these changes tested?

Not yet. Unit and integration tests to follow with the implementation, once the design has been approved.

Are there any user-facing changes?

Not yet

@sdd
Copy link
Contributor Author

sdd commented Mar 9, 2024

@liurenjie1024 PTAL: Your thoughts on this very much appreciated. This also depends on your predicate binding PR, #231, getting merged.

@Fokko @Xuanwo @ZENOTME @JanKaul would also appreciate any of your thoughts also.

@liurenjie1024
Copy link
Contributor

Hi, @sdd Thanks for this pr. The filtering process in fact consists of two steps:

  1. Filter manifest files in manifest list. This step is relative easy to do, and has no dependency, see python implementation.

  2. Implement partition pruning. This step is relative complicated, since it has several dependencies, such as inclusive projections, data file evaluator, you can see python implementation.

My suggestion is to finish ManifestEvalVisitor first since it's relative easy and has no dependency.

@sdd
Copy link
Contributor Author

sdd commented Mar 11, 2024

Sounds good @liurenjie1024 - will do.

@sdd sdd force-pushed the sdd/issue-152 branch 2 times, most recently from a066242 to b89648d Compare March 17, 2024 13:29
@@ -599,6 +635,14 @@ impl Display for BoundPredicate {
}
}

pub(crate) trait PredicateVisitor<T> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think these are needed anymore.

@sdd
Copy link
Contributor Author

sdd commented Mar 17, 2024

@liurenjie1024 @Xuanwo @ZENOTME: I've updated this PR, details in the updated description at the top. It builds on the Bind implementation that recently got merged and is all set to make use of the Transform::project implementation in #269.

There are a few issues that I could do with some help on how best to get them resolved, mentioned in the "Unsolved Problems" section of the description at the top if this PR.

@sdd sdd changed the title [WIP] Add PartitionEvaluator to allow filtering of files in a table scan (Issue #152) [WIP] Add ManifestEvaluator to allow filtering of files in a table scan (Issue #152) Mar 17, 2024
AlwaysTrue => true,
BoundPredicate::AlwaysFalse => false,
BoundPredicate::And(expr) => {
self.visit(expr.inputs()[0], partitions) && self.visit(expr.inputs()[1], partitions)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can have something like fn eval(&self, Vec<Literal>)->bool for the Predicate.

@sdd
Copy link
Contributor Author

sdd commented Mar 23, 2024

@liurenjie1024 are you able to take a look at this again please when you have time? I've resolved the outstanding issues with binding and added accessors to address the missing functionality.

Copy link
Contributor

@marvinlanhenke marvinlanhenke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sdd
Thanks again for your work. I really like the overall structure and the work you've already done here! However, I left some comments about some design decisions, where I think we can improve - or at least I'm unsure if there isn't a better way.

partition_filter: Predicate,
case_sensitive: bool,
) -> crate::Result<Self> {
let partition_filter = partition_filter.bind(partition_schema.clone(), case_sensitive)?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we have rewrite_not already in place to be applied to partition_filter?

.unwrap();
/// Evaluates manifest files to see if their partition values comply with a filter predicate
pub struct PartitionEvaluator {
manifest_eval_visitor: ManifestEvalVisitor,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should nest ManifestEvalVisitor inside PartitionEvaluator since both structs have separate responsibilities and could exists independently (I shouldn't need a ManifestEvalVisitor to do partition pruning).

Also I think the implementation of PartitionEvaluator is missing [reference]?
It should visit each DataFiles struct and evaluate the predicate.

.or_insert_with_key(|key| self.create_partition_evaluator(key, filter));

// reject any manifest files whose partition values don't match the filter.
if !partition_evaluator.filter_manifest_file(entry) {
Copy link
Contributor

@marvinlanhenke marvinlanhenke Apr 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should apply ManifestEvaluator independently from PartitionEvaluator (see comment down below in struct PartitionEvaluator).

Also, I'm not sure if we can simply 'skip' the file with continue. In the python impl the Partition- and MetricsEvaluator are applied on each DataFile on multiple threads - so I believe we have to collect all relevant manifest files before and then apply the Partition- and MetricsEvaluator on the complete (pre-filtered) collection, in order to split the work into multiple threads?

@marvinlanhenke
Copy link
Contributor

@liurenjie1024 @ZENOTME @sdd @Xuanwo
I'd really appreciate your thoughts on this:

I took a closer look at the work @sdd has already done - and I think in order to proceed it would make sense to split the ManifestEvaluator, InclusiveProjection and PartitionEvaluator into separate modules.

I am thinking about putting each visitor into its own file within /iceberg/src/expr (perhaps even another subfolder /visitors) - for example: /iceberg/src/expr/manifest_evaluator.rs.

This way, we could create individual issues on each implementation - and work better in parallel.

Perhaps, it also makes sense to provide a trait to make sure each visitor adheres to the same interface
(although I'm not sure this is neccessary right now...)

// Pseudo-Example
pub trait BooleanVisitor:
  fn eval() -> boolean

@sdd
Copy link
Contributor Author

sdd commented Apr 2, 2024

Thanks for the review @marvinlanhenke ! I'll address the comments today or tomorrow morning 😁

@sdd
Copy link
Contributor Author

sdd commented Apr 2, 2024

Also I've got some uncommitted changes around having a boolean evaluator trait, similar to your suggestion. It has default implementations for and / or / not / always true / always false, but requires the others to be implemented by any struct implementing the trait.

@liurenjie1024
Copy link
Contributor

@liurenjie1024 @ZENOTME @sdd @Xuanwo I'd really appreciate your thoughts on this:

I took a closer look at the work @sdd has already done - and I think in order to proceed it would make sense to split the ManifestEvaluator, InclusiveProjection and PartitionEvaluator into separate modules.

I am thinking about putting each visitor into its own file within /iceberg/src/expr (perhaps even another subfolder /visitors) - for example: /iceberg/src/expr/manifest_evaluator.rs.

This way, we could create individual issues on each implementation - and work better in parallel.

Perhaps, it also makes sense to provide a trait to make sure each visitor adheres to the same interface (although I'm not sure this is neccessary right now...)

// Pseudo-Example
pub trait BooleanVisitor:
  fn eval() -> boolean

+1 for splitting these into split modules. Instead of a boolean visitor, I'm thinking about a post order predicate visitor:

pub trait BoundPredicateVisitor {
    
}

which is somehow motivated by @viirya 's pr, but move the travsering flow out of trait body.

@marvinlanhenke
Copy link
Contributor

marvinlanhenke commented Apr 2, 2024

which is somehow motivated by @viirya 's pr, but move the travsering flow out of trait body.

@liurenjie1024
...so the traversal flow could then be implemented on e.g. the ManifestEvaluator itself.
For example, eval() could call the corresponding 'visit_xx' on the visitor that implements BoundPredicateVisitor. Is this what you mean?

@liurenjie1024
Copy link
Contributor

which is somehow motivated by @viirya 's pr, but move the travsering flow out of trait body.

@liurenjie1024 ...so the traversal flow could then be implemented on e.g. the ManifestEvaluator itself. For example, eval() could call the corresponding 'visit_xx' on the visitor that implements BoundPredicateVisitor. Is this what you mean?

I was thinking about following following structure:

pub trait BoundPredicateVisitor {
  type T;

  fn visit_and(&mut self, values: [Self::T; 2]) -> Result<Self::T>;
  fn visit_or(&mut self, values: [Self::T; 2]) -> Result<Self::T>;

  ...
}

pub fn visit_bound_predicate(visitor: &mut V, predicate: &BoundPredicate) -> Result<V::T> {
   match predicate {
    BoundPredicate::And(children) => {
      let ret = [visit_bound_predicate(visitor, children[0]),visit_bound_predicate(visitor, children[1])];
      visitor.visit_and(ret)
  },
  ...
 }
}

pub struct ManifestEvaluator {}

impl BoundPredicateVisitor  for ManifestEvaluator {}

impl ManifestEvaluator  {

  pub fn eval(&mut self, predicate: &BoundPredicate) -> bool {
    visit_bound_predicate(self, predicate)?
  } 
}

@sdd
Copy link
Contributor Author

sdd commented Apr 3, 2024

Hi @liurenjie1024 and @marvinlanhenke. I've updated this PR to align with the direction suggested above - thanks a lot for all the feedback!

This PR is clearly far too big to easily digest now though. I'll rebase it and resubmit it in pieces as separate PRs:

  • Accessors
  • BoundPredicateVisitor
  • InclusiveProjection
  • ManifestEvaluator itself
  • Misc changes such as adding AlwaysTrue / AlwaysFalse and any new getters
  • Changes to scan.rs that bring it all together

Unless you have any other suggestions. I'll happily close any of the raised PRs if they don't match the direction we agree to go with this.

Copy link
Contributor

@marvinlanhenke marvinlanhenke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sdd
Thanks again for your work here. I really think the split into modules was worth it. I also like your suggestion about splitting this into different PRs to make it more digestable. I left you some notes about the current implementation - which would be nice to consider in the following PRs.

use crate::Result;
use fnv::FnvHashSet;

pub trait BoundPredicateEvaluator {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, it'd be better to rename this to BoundPredicateVisitor

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Perhaps we can move the complete trait into /visitors/mod.rs? However, I am not sure about this - so I'd appreciate your thoughts on this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


pub trait BoundPredicateEvaluator {
fn visit(&mut self, node: &BoundPredicate) -> Result<bool> {
match node {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

according to the design suggestion by @liurenjie1024 link this function should live outside the trait as fn visit_bound_predicate. Other than that traversal logic looks fine.

partition_spec: PartitionSpecRef,
}

impl InclusiveProjection {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll guess this will be done in the new PR - InclusiveProjection should impl pub trait BoundPredicateVisitor

}
}

pub(crate) fn project(&self, predicate: &BoundPredicate) -> crate::Result<Predicate> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: although it does align with the Python impl. I'd rename this to eval to keep it consistent across all visitors

case_sensitive: bool,
}

impl ManifestEvaluatorFactory {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we can turn this into a proper Builder similar to the python function. I'm thinking ManifestEvaluatorBuilder as a derived TypedBuilder would probably be enough. This way we can provide table_schema, partition_spec and the row_filter and orchestrate the InclusiveProjection when we call eval on the ManifestEvaluator itself? What do you think?

})
}

pub(crate) fn evaluate(&self, manifest_file: &ManifestFile) -> crate::Result<bool> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: rename to eval

@sdd
Copy link
Contributor Author

sdd commented Apr 9, 2024

Closing this PR as it has been split into #317, #320, #321, #322, and #323

@sdd sdd closed this Apr 9, 2024
@sdd sdd deleted the sdd/issue-152 branch August 2, 2024 06:12
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants