Skip to content

Avoid evaluating filters when they can be discarded purely from statistics #15425

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Open
adriangb opened this issue Mar 25, 2025 · 3 comments
Open
Labels
enhancement New feature or request

Comments

@adriangb
Copy link
Contributor

adriangb commented Mar 25, 2025

Is your feature request related to a problem or challenge?

Currently stats filter pruning (both at the row group and page level) has one of two outcomes per container:

  1. This container cannot possibly match the filter (discard it).
  2. This container may match the filter, but which rows to include or exclude needs to be confirmed by evaluating each row of the data.

There is a big optimization here which is if we know that every row in the container matches the filter, we don't need to evaluate the filter at all.

Consider a column name with values ["Adrian", "Adrian", "Adrian"]. The min/max stats are "Adrian"/"Adrian". A query with the filter name = "Adrian" should not need to ever read the column to know that all rows match the filter.

Another relevant case is a ts column with values ["2025-01-01T00:00:00Z", ..., "2025-01-01T00:01:32Z"]. The values need not be sorted or ordered, but let's say that the min/max stats are "2025-01-01T00:00:00Z"/"2025-01-01T00:01:32Z". For a filter ts > '2024-12-31T00:00:00Z' there should be no need to evaluate the filter on every row: we know just from stats that every row matches.

We could incorporate this change, but it would require some refactoring of https://github.com/apache/datafusion/blob/main/datafusion/physical-optimizer/src/pruning.rs and consumers.

@adriangb adriangb added the enhancement New feature or request label Mar 25, 2025
@matthewmturner
Copy link
Contributor

I think this is similar to something I recently asked on Discord - except I had in mind using only the metadata stats for queries like "SELECT MAX(timestamp) FROM quotes"

This was my full comment / question

"Im doing some data exploration on a table in datafusion where im running the following SELECT MAX(timestamp) FROM quotes. The quotes table is about 100GB of data. When i run EXPLAIN ANALYZE on this plan i see from the ParquetExec 6B+ output rows and 30GB+ of bytes scanned. Given that I'm only getting the MAX for the column shouldnt I be able to get this by doing much less work and only looking at the row group metadata stats and not scanning any data? That would give me a huge performance improvement (the metadata load time is < 1% of the total time scanning)."

@adriangb
Copy link
Contributor Author

I think that has more to do with the datafusion.execution.collect_statistics setting and use of those statistics

@adriangb
Copy link
Contributor Author

adriangb commented Mar 25, 2025

Well if you did set datafusion.execution.collect_statistics I think those stats would be used to calculate max before getting to the scanning phase via some rewrites but I think you're right that even if you didn't collect statistics upfront if that expression could be pushed down into each individual file scan then it could be optimized. Maybe related to #14993?

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants