Avoid evaluating filters when they can be discarded purely from statistics #15425

adriangb · 2025-03-25T20:21:38Z

Is your feature request related to a problem or challenge?

Currently stats filter pruning (both at the row group and page level) has one of two outcomes per container:

This container cannot possibly match the filter (discard it).
This container may match the filter, but which rows to include or exclude needs to be confirmed by evaluating each row of the data.

There is a big optimization here which is if we know that every row in the container matches the filter, we don't need to evaluate the filter at all.

Consider a column name with values ["Adrian", "Adrian", "Adrian"]. The min/max stats are "Adrian"/"Adrian". A query with the filter name = "Adrian" should not need to ever read the column to know that all rows match the filter.

Another relevant case is a ts column with values ["2025-01-01T00:00:00Z", ..., "2025-01-01T00:01:32Z"]. The values need not be sorted or ordered, but let's say that the min/max stats are "2025-01-01T00:00:00Z"/"2025-01-01T00:01:32Z". For a filter ts > '2024-12-31T00:00:00Z' there should be no need to evaluate the filter on every row: we know just from stats that every row matches.

We could incorporate this change, but it would require some refactoring of https://github.com/apache/datafusion/blob/main/datafusion/physical-optimizer/src/pruning.rs and consumers.

The text was updated successfully, but these errors were encountered:

matthewmturner · 2025-03-25T21:44:47Z

I think this is similar to something I recently asked on Discord - except I had in mind using only the metadata stats for queries like "SELECT MAX(timestamp) FROM quotes"

This was my full comment / question

"Im doing some data exploration on a table in datafusion where im running the following SELECT MAX(timestamp) FROM quotes. The quotes table is about 100GB of data. When i run EXPLAIN ANALYZE on this plan i see from the ParquetExec 6B+ output rows and 30GB+ of bytes scanned. Given that I'm only getting the MAX for the column shouldnt I be able to get this by doing much less work and only looking at the row group metadata stats and not scanning any data? That would give me a huge performance improvement (the metadata load time is < 1% of the total time scanning)."

adriangb · 2025-03-25T21:49:08Z

I think that has more to do with the datafusion.execution.collect_statistics setting and use of those statistics

adriangb · 2025-03-25T21:55:03Z

Well if you did set datafusion.execution.collect_statistics I think those stats would be used to calculate max before getting to the scanning phase via some rewrites but I think you're right that even if you didn't collect statistics upfront if that expression could be pushed down into each individual file scan then it could be optimized. Maybe related to #14993?

adriangb added the enhancement New feature or request label Mar 25, 2025

adriangb mentioned this issue Mar 26, 2025

Consolidate interval analysies from Interval and PruningPredicate #7887

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid evaluating filters when they can be discarded purely from statistics #15425

Avoid evaluating filters when they can be discarded purely from statistics #15425

adriangb commented Mar 25, 2025 •

edited

Loading

matthewmturner commented Mar 25, 2025

adriangb commented Mar 25, 2025

adriangb commented Mar 25, 2025 •

edited

Loading

Avoid evaluating filters when they can be discarded purely from statistics #15425

Avoid evaluating filters when they can be discarded purely from statistics #15425

Comments

adriangb commented Mar 25, 2025 • edited Loading

Is your feature request related to a problem or challenge?

matthewmturner commented Mar 25, 2025

adriangb commented Mar 25, 2025

adriangb commented Mar 25, 2025 • edited Loading

adriangb commented Mar 25, 2025 •

edited

Loading

adriangb commented Mar 25, 2025 •

edited

Loading