Vectorize window functions #15607

Dandandan · 2025-04-06T17:44:51Z

Is your feature request related to a problem or challenge?

[More info to be added later]

DataFusion has quite good support for window functions, but (part of) the window function implementation is not yet (I guess mostly because we don't run popular benchmarks with window functions, tpch and clickbench don't have any).

We can try giving it a similar treatment as aggregates (e.g. GroupsAccumulator).

Describe the solution you'd like

Design a strategy and implement vectorized code for window functions (sum/count, etc) where possible.
Benchmark the improvement.

Describe alternatives you've considered

No response

Additional context

[Add a profiling image / info here]

suibianwanwank · 2025-04-08T14:11:38Z

I'll try to look into it, but it may take some time. If anyone else is working on this, I'd be happy to help in any way I can:)

alamb · 2025-04-08T21:02:36Z

I think we should start with creating a benchmark -- ideally a standard one that contains window functions

ding-young · 2025-04-09T01:26:49Z

@suibianwanwank @alamb Would it be okay for me to take a stab on this issue? I'll ask for help if I get stuck :)

suibianwanwank · 2025-04-09T15:07:24Z

@alamb I noticed we already have window_query_sql benchmark, which covers basic window functions. Maybe optimizing based on this benchmark would be good.

Dandandan · 2025-04-10T21:00:19Z

I highly recommend running the benchmark with the samply profiler, documented here:

https://github.com/apache/datafusion/blob/main/docs/source/library-user-guide/profiling.md#profiling-using-samply-cross-platform-profiler

Dandandan · 2025-04-10T21:05:44Z

Btw - you can run profile a benchmark after compiling with

samply record -r 1000 ./target/profiling/deps/window_query_sql-[....] --bench --noplot "window partition and order by, u64_wide"

And you'll get a UI which looks like this:

suibianwanwank · 2025-05-06T16:33:02Z

@Dandandan I've reviewed the current implementation of window functions in DataFusion and studied some related concepts. This DuckDB blog post, Flying Through Windows, describes their efforts in optimizing window functions. However, in practice, many of these optimizations are function-specific, and I don't yet have a good idea for implementing a general-purpose vectorized solution.

Additionally, I noticed that we haven't implemented the segment tree and vectorized optimizations mentioned in the DuckDB blog. This is because our current window frame implementation does not yet support flexible expr in frames. Before proceeding with such optimizations, should we first implement the feature described in ISSUE-15714?

Btw, I attempted to support streaming of aggregation results for different ranges within the same partition in BoundedWindowAggStream. But current benchmarks haven’t shown significant performance gains, with most of the time spent in calculate_range. I plan to add more complex benchmark case to assess performance improvements.

alamb · 2025-05-06T20:26:37Z

@Dandandan I've reviewed the current implementation of window functions in DataFusion and studied some related concepts. This DuckDB blog post, Flying Through Windows, describes their efforts in optimizing window functions. However, in practice, many of these optimizations are function-specific, and I don't yet have a good idea for implementing a general-purpose vectorized solution.

Yeah I am not sure that blog post has a lot to offer DataFusion

Additionally, I noticed that we haven't implemented the segment tree and vectorized optimizations mentioned in the DuckDB blog. This is because our current window frame implementation does not yet support flexible expr in frames. Before proceeding with such optimizations, should we first implement the feature described in ISSUE-15714?

It seems like vectorized implementations would be a great first step but they may be independent of calculating the window ranges.

In terms of Segment Trees, I know the TUM folks wrote a paper about how great it was but to my knowledge the technique is only implemented in Umbra and DuckDB. The content of Flying Through Windows certainly implies to me that there isn't really a "well known" pattern / easy to implement segment tree approach

I think the technique of sorting the windows by ORDER BY/ PARTITION BY (which Datafusion implements) is far more common and has many nice properties (like the sorting can take advantage of the work to improve larger than memory sorting)

Btw, I attempted to support streaming of aggregation results for different ranges within the same partition in BoundedWindowAggStream. But current benchmarks haven’t shown significant performance gains, with most of the time spent in calculate_range. I plan to add more complex benchmark case to assess performance improvements.

Thank you!

FYI @akurmustafa / @mustafasrepo in case you have other thoughts on this matter

akurmustafa · 2025-05-07T00:05:17Z

Thanks for pushing this feature. Currently, we don't have support for flexible expressions in window frames. Naive implementation of this feature would be O(n^2). Segment tree would decrease this complexity to the O(n*log(n)).

According to me, vectorization and support for flexible expressions in window frames (naive implementation) are kind of orthogonal. I think, we can either implement support for flexible window frames with naive implementation or vectorization first. I also think that Segment tree is kind of non-standard and has benefits only for some specific use cases. Segment tree should be done after these 2 stages (if it will be done at some point, also these 2 stages are kind of preliminary to trigger the use case for Segment tree).

Before proceeding with such optimizations, should we first implement the feature described in #15714?

@suibianwanwank, we can implement the feature in issue15714 first definitely. However, we shold start with the naive implementation (where window frame boundaries are calculated each time using binary search) without segment tree.

2010YOUY01 · 2025-05-07T09:10:23Z

@Dandandan I've reviewed the current implementation of window functions in DataFusion and studied some related concepts. This DuckDB blog post, Flying Through Windows, describes their efforts in optimizing window functions. However, in practice, many of these optimizations are function-specific, and I don't yet have a good idea for implementing a general-purpose vectorized solution.

Additionally, I noticed that we haven't implemented the segment tree and vectorized optimizations mentioned in the DuckDB blog. This is because our current window frame implementation does not yet support flexible expr in frames. Before proceeding with such optimizations, should we first implement the feature described in ISSUE-15714?

Btw, I attempted to support streaming of aggregation results for different ranges within the same partition in BoundedWindowAggStream. But current benchmarks haven’t shown significant performance gains, with most of the time spent in calculate_range. I plan to add more complex benchmark case to assess performance improvements.

It's a good idea to start with more benchmarks and expr in window range feature.

AFAIK, there is no well-known end-to-end benchmark focused on different types of window functions (for example ClickBench focuses on aggregations, but we lack a similar benchmark targeting window functions). Designing one specifically for window functions could provide significant value.

DuckDB maintains many micro-benchmarks for window functions, which might serve as good inspiration.

We could also include known anti-patterns that the segment tree approach aims to address. Segment tree optimization is a "worst-case optimal" technique, but (to me) it's unclear what kind of workload would actually trigger such worst cases. In most common scenarios, window functions behave more like sliding windows with small border adjustments, which are unlikely to cause major slowdowns. Therefore, it might be worthwhile to first identify such anti-patterns to justify the effort required to implement segment tree optimizations.

Dandandan · 2025-05-07T09:26:49Z

My idea for the scope of this issue wasn't to support arbitrary expressions in the window frames / implementSegment Tree, but rather vectorize the existing implementation - i.e.

limit allocations (e.g. ScalarValue) in inner loops (i.e. the mentioned calculate_range)
try doing calculations on a batch level instead of per-value

I think for the constant value the non-segment tree algorithm probably is the best choice anyway (and we should keep it), looking at this table from the paper linked by @alamb (even when same complexity updating the segment tree will probably be slower than the existing algorithm):

suibianwanwank · 2025-05-07T11:03:32Z

Sorry for mixing up the two issues and causing some confusion. I agree that vectorizing the current implementation and supporting expr in frame are orthogonal.

Dandandan added the performance Make DataFusion faster label Apr 6, 2025

Dandandan changed the title ~~Vectorized window functions~~ Vectorize window functions Apr 6, 2025

Dandandan mentioned this issue May 1, 2025

[DISCUSSION] DataFusion Road Map: Q3-Q4 2025 #15878

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vectorize window functions #15607

Vectorize window functions #15607

Dandandan commented Apr 6, 2025 •

edited

Loading

suibianwanwank commented Apr 8, 2025

alamb commented Apr 8, 2025

ding-young commented Apr 9, 2025

suibianwanwank commented Apr 9, 2025

Dandandan commented Apr 10, 2025

Dandandan commented Apr 10, 2025 •

edited

Loading

suibianwanwank commented May 6, 2025

alamb commented May 6, 2025

akurmustafa commented May 7, 2025 •

edited

Loading

2010YOUY01 commented May 7, 2025

Dandandan commented May 7, 2025 •

edited

Loading

suibianwanwank commented May 7, 2025

Vectorize window functions #15607

Vectorize window functions #15607

Comments

Dandandan commented Apr 6, 2025 • edited Loading

Is your feature request related to a problem or challenge?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

suibianwanwank commented Apr 8, 2025

alamb commented Apr 8, 2025

ding-young commented Apr 9, 2025

suibianwanwank commented Apr 9, 2025

Dandandan commented Apr 10, 2025

Dandandan commented Apr 10, 2025 • edited Loading

suibianwanwank commented May 6, 2025

alamb commented May 6, 2025

akurmustafa commented May 7, 2025 • edited Loading

2010YOUY01 commented May 7, 2025

Dandandan commented May 7, 2025 • edited Loading

suibianwanwank commented May 7, 2025

Dandandan commented Apr 6, 2025 •

edited

Loading

Dandandan commented Apr 10, 2025 •

edited

Loading

akurmustafa commented May 7, 2025 •

edited

Loading

Dandandan commented May 7, 2025 •

edited

Loading