Skip to content

Add example of using PruningPredicate to datafusion-examples #9183

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Merged
merged 4 commits into from
Feb 10, 2024

Conversation

alamb
Copy link
Contributor

@alamb alamb commented Feb 9, 2024

Which issue does this PR close?

Part of #7013

Related to #7869 and #9171

Rationale for this change

  1. We rely heavily on PruningPredicate in InfluxDB to prune data based on catalog information, so I want an easier way to point our engineers at it and understand how it works.
  2. We also had some non trivial confusion internally about how/if pruning predicates handled unknown column values, which I wanted to document

What changes are included in this PR?

  1. Add pruning.rs example to datafusion-examples with an annotated guide to using `PruningPredicate
  2. Add link to the example in the PruningPredicate API docs

Are these changes tested?

Yes, as part of CI

Are there any user-facing changes?

A new example, no code changes

@github-actions github-actions bot added the core Core DataFusion crate label Feb 9, 2024
// File 2: `x = 5 AND y = 10` can never evaluate to true because y
// has only the value of 7. Thus this file can be skipped.
false,
// File 3: `x = 5 AND y = 10` can never evaluate to true because x
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI @appletreeisyellow here is an actual example showing that the pruning predicate does the right thing with unknown column values

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

File 3 example makes sense to me 👍

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious what the result will be for a file 4 like:

File 4: x has values between 4 and 6
nothing is known about the value of y

Same the predicate x = 5 AND y = 10, my understanding is that it will evaluate to true.

x = 5 AND y = 10
--> true AND null
--> null

Since y is unknown, so there is a possibility that y is 10 in this file / partition / row group of data. Thus this file can not be skipped and the result is true

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same the predicate x = 5 AND y = 10, my understanding is that it will evaluate to true.

Yes, this is my understanding too (that the PruningPredicate will return true for this container)

Since y is unknown, so there is a possibility that y is 10 in this file / partition / row group of data. Thus this file can not be skipped and the result is true

Yes, that is my understanding as well

Copy link
Contributor

@comphead comphead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice @alamb I love reviewing such docs as it gives more understanding
There likely a typo

@alamb
Copy link
Contributor Author

alamb commented Feb 9, 2024

Nice @alamb I love reviewing such docs as it gives more understanding There likely a typo

Nice eyes -- thanks @comphead

BTW if you like reading background material, just wait for #9184 :)

Copy link
Contributor

@appletreeisyellow appletreeisyellow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for adding examples @alamb. Super helpful! I left a question for a new example and a suggestion

// File 2: `x = 5 AND y = 10` can never evaluate to true because y
// has only the value of 7. Thus this file can be skipped.
false,
// File 3: `x = 5 AND y = 10` can never evaluate to true because x
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

File 3 example makes sense to me 👍

Comment on lines +123 to +125
// Note, returning null means the value isn't known, NOT
// that we know the entire column is null.
(None, None),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That probably looks familiar :)

// File 2: `x = 5 AND y = 10` can never evaluate to true because y
// has only the value of 7. Thus this file can be skipped.
false,
// File 3: `x = 5 AND y = 10` can never evaluate to true because x
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious what the result will be for a file 4 like:

File 4: x has values between 4 and 6
nothing is known about the value of y

Same the predicate x = 5 AND y = 10, my understanding is that it will evaluate to true.

x = 5 AND y = 10
--> true AND null
--> null

Since y is unknown, so there is a possibility that y is 10 in this file / partition / row group of data. Thus this file can not be skipped and the result is true

Co-authored-by: Chunchun Ye <14298407+appletreeisyellow@users.noreply.github.com>
Copy link
Contributor

@comphead comphead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm thanks @alamb

@comphead comphead merged commit a48e271 into apache:main Feb 10, 2024
@alamb alamb deleted the alamb/pruning_example branch February 10, 2024 21:29
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
core Core DataFusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants