You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem or challenge?
Currently stats filter pruning (both at the row group and page level) has one of two outcomes per container:
This container cannot possibly match the filter (discard it).
This container may match the filter, but which rows to include or exclude needs to be confirmed by evaluating each row of the data.
There is a big optimization here which is if we know that every row in the container matches the filter, we don't need to evaluate the filter at all.
Consider a column name with values ["Adrian", "Adrian", "Adrian"]. The min/max stats are "Adrian"/"Adrian". A query with the filter name = "Adrian" should not need to ever read the column to know that all rows match the filter.
Another relevant case is a ts column with values ["2025-01-01T00:00:00Z", ..., "2025-01-01T00:01:32Z"]. The values need not be sorted or ordered, but let's say that the min/max stats are "2025-01-01T00:00:00Z"/"2025-01-01T00:01:32Z". For a filter ts > '2024-12-31T00:00:00Z' there should be no need to evaluate the filter on every row: we know just from stats that every row matches.
I think this is similar to something I recently asked on Discord - except I had in mind using only the metadata stats for queries like "SELECT MAX(timestamp) FROM quotes"
This was my full comment / question
"Im doing some data exploration on a table in datafusion where im running the following SELECT MAX(timestamp) FROM quotes. The quotes table is about 100GB of data. When i run EXPLAIN ANALYZE on this plan i see from the ParquetExec 6B+ output rows and 30GB+ of bytes scanned. Given that I'm only getting the MAX for the column shouldnt I be able to get this by doing much less work and only looking at the row group metadata stats and not scanning any data? That would give me a huge performance improvement (the metadata load time is < 1% of the total time scanning)."
Well if you did set datafusion.execution.collect_statistics I think those stats would be used to calculate max before getting to the scanning phase via some rewrites but I think you're right that even if you didn't collect statistics upfront if that expression could be pushed down into each individual file scan then it could be optimized. Maybe related to #14993?
Is your feature request related to a problem or challenge?
Currently stats filter pruning (both at the row group and page level) has one of two outcomes per container:
There is a big optimization here which is if we know that every row in the container matches the filter, we don't need to evaluate the filter at all.
Consider a column
name
with values["Adrian", "Adrian", "Adrian"]
. The min/max stats are"Adrian"/"Adrian"
. A query with the filtername = "Adrian"
should not need to ever read the column to know that all rows match the filter.Another relevant case is a
ts
column with values["2025-01-01T00:00:00Z", ..., "2025-01-01T00:01:32Z"]
. The values need not be sorted or ordered, but let's say that the min/max stats are"2025-01-01T00:00:00Z"/"2025-01-01T00:01:32Z"
. For a filterts > '2024-12-31T00:00:00Z'
there should be no need to evaluate the filter on every row: we know just from stats that every row matches.We could incorporate this change, but it would require some refactoring of https://github.com/apache/datafusion/blob/main/datafusion/physical-optimizer/src/pruning.rs and consumers.
The text was updated successfully, but these errors were encountered: