Skip to content

Vectorize window functions #15607

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Open
Dandandan opened this issue Apr 6, 2025 · 12 comments
Open

Vectorize window functions #15607

Dandandan opened this issue Apr 6, 2025 · 12 comments
Labels
performance Make DataFusion faster

Comments

@Dandandan
Copy link
Contributor

Dandandan commented Apr 6, 2025

Is your feature request related to a problem or challenge?

[More info to be added later]

DataFusion has quite good support for window functions, but (part of) the window function implementation is not yet (I guess mostly because we don't run popular benchmarks with window functions, tpch and clickbench don't have any).

We can try giving it a similar treatment as aggregates (e.g. GroupsAccumulator).

Describe the solution you'd like

Design a strategy and implement vectorized code for window functions (sum/count, etc) where possible.
Benchmark the improvement.

Describe alternatives you've considered

No response

Additional context

[Add a profiling image / info here]

@Dandandan Dandandan added the performance Make DataFusion faster label Apr 6, 2025
@Dandandan Dandandan changed the title Vectorized window functions Vectorize window functions Apr 6, 2025
@suibianwanwank
Copy link
Contributor

I'll try to look into it, but it may take some time. If anyone else is working on this, I'd be happy to help in any way I can:)

@alamb
Copy link
Contributor

alamb commented Apr 8, 2025

I think we should start with creating a benchmark -- ideally a standard one that contains window functions

@ding-young
Copy link
Contributor

@suibianwanwank @alamb Would it be okay for me to take a stab on this issue? I'll ask for help if I get stuck :)

@suibianwanwank
Copy link
Contributor

@alamb I noticed we already have window_query_sql benchmark, which covers basic window functions. Maybe optimizing based on this benchmark would be good.

@Dandandan
Copy link
Contributor Author

I highly recommend running the benchmark with the samply profiler, documented here:

https://github.com/apache/datafusion/blob/main/docs/source/library-user-guide/profiling.md#profiling-using-samply-cross-platform-profiler

@Dandandan
Copy link
Contributor Author

Dandandan commented Apr 10, 2025

Btw - you can run profile a benchmark after compiling with

samply record -r 1000 ./target/profiling/deps/window_query_sql-[....] --bench --noplot "window partition and order by, u64_wide"

And you'll get a UI which looks like this:

Image

@suibianwanwank
Copy link
Contributor

@Dandandan I've reviewed the current implementation of window functions in DataFusion and studied some related concepts. This DuckDB blog post, Flying Through Windows, describes their efforts in optimizing window functions. However, in practice, many of these optimizations are function-specific, and I don't yet have a good idea for implementing a general-purpose vectorized solution.

Additionally, I noticed that we haven't implemented the segment tree and vectorized optimizations mentioned in the DuckDB blog. This is because our current window frame implementation does not yet support flexible expr in frames. Before proceeding with such optimizations, should we first implement the feature described in ISSUE-15714?

Btw, I attempted to support streaming of aggregation results for different ranges within the same partition in BoundedWindowAggStream. But current benchmarks haven’t shown significant performance gains, with most of the time spent in calculate_range. I plan to add more complex benchmark case to assess performance improvements.

@alamb
Copy link
Contributor

alamb commented May 6, 2025

@Dandandan I've reviewed the current implementation of window functions in DataFusion and studied some related concepts. This DuckDB blog post, Flying Through Windows, describes their efforts in optimizing window functions. However, in practice, many of these optimizations are function-specific, and I don't yet have a good idea for implementing a general-purpose vectorized solution.

Yeah I am not sure that blog post has a lot to offer DataFusion

Additionally, I noticed that we haven't implemented the segment tree and vectorized optimizations mentioned in the DuckDB blog. This is because our current window frame implementation does not yet support flexible expr in frames. Before proceeding with such optimizations, should we first implement the feature described in ISSUE-15714?

It seems like vectorized implementations would be a great first step but they may be independent of calculating the window ranges.

In terms of Segment Trees, I know the TUM folks wrote a paper about how great it was but to my knowledge the technique is only implemented in Umbra and DuckDB. The content of Flying Through Windows certainly implies to me that there isn't really a "well known" pattern / easy to implement segment tree approach

I think the technique of sorting the windows by ORDER BY/ PARTITION BY (which Datafusion implements) is far more common and has many nice properties (like the sorting can take advantage of the work to improve larger than memory sorting)

Btw, I attempted to support streaming of aggregation results for different ranges within the same partition in BoundedWindowAggStream. But current benchmarks haven’t shown significant performance gains, with most of the time spent in calculate_range. I plan to add more complex benchmark case to assess performance improvements.

Thank you!

FYI @akurmustafa / @mustafasrepo in case you have other thoughts on this matter

@akurmustafa
Copy link
Contributor

akurmustafa commented May 7, 2025

Thanks for pushing this feature. Currently, we don't have support for flexible expressions in window frames. Naive implementation of this feature would be O(n^2). Segment tree would decrease this complexity to the O(n*log(n)).

According to me, vectorization and support for flexible expressions in window frames (naive implementation) are kind of orthogonal. I think, we can either implement support for flexible window frames with naive implementation or vectorization first. I also think that Segment tree is kind of non-standard and has benefits only for some specific use cases. Segment tree should be done after these 2 stages (if it will be done at some point, also these 2 stages are kind of preliminary to trigger the use case for Segment tree).

Before proceeding with such optimizations, should we first implement the feature described in #15714?

@suibianwanwank, we can implement the feature in issue15714 first definitely. However, we shold start with the naive implementation (where window frame boundaries are calculated each time using binary search) without segment tree.

@2010YOUY01
Copy link
Contributor

@Dandandan I've reviewed the current implementation of window functions in DataFusion and studied some related concepts. This DuckDB blog post, Flying Through Windows, describes their efforts in optimizing window functions. However, in practice, many of these optimizations are function-specific, and I don't yet have a good idea for implementing a general-purpose vectorized solution.

Additionally, I noticed that we haven't implemented the segment tree and vectorized optimizations mentioned in the DuckDB blog. This is because our current window frame implementation does not yet support flexible expr in frames. Before proceeding with such optimizations, should we first implement the feature described in ISSUE-15714?

Btw, I attempted to support streaming of aggregation results for different ranges within the same partition in BoundedWindowAggStream. But current benchmarks haven’t shown significant performance gains, with most of the time spent in calculate_range. I plan to add more complex benchmark case to assess performance improvements.

It's a good idea to start with more benchmarks and expr in window range feature.

AFAIK, there is no well-known end-to-end benchmark focused on different types of window functions (for example ClickBench focuses on aggregations, but we lack a similar benchmark targeting window functions). Designing one specifically for window functions could provide significant value.

DuckDB maintains many micro-benchmarks for window functions, which might serve as good inspiration.

We could also include known anti-patterns that the segment tree approach aims to address. Segment tree optimization is a "worst-case optimal" technique, but (to me) it's unclear what kind of workload would actually trigger such worst cases. In most common scenarios, window functions behave more like sliding windows with small border adjustments, which are unlikely to cause major slowdowns. Therefore, it might be worthwhile to first identify such anti-patterns to justify the effort required to implement segment tree optimizations.

@Dandandan
Copy link
Contributor Author

Dandandan commented May 7, 2025

My idea for the scope of this issue wasn't to support arbitrary expressions in the window frames / implementSegment Tree, but rather vectorize the existing implementation - i.e.

  • limit allocations (e.g. ScalarValue) in inner loops (i.e. the mentioned calculate_range)
  • try doing calculations on a batch level instead of per-value

I think for the constant value the non-segment tree algorithm probably is the best choice anyway (and we should keep it), looking at this table from the paper linked by @alamb (even when same complexity updating the segment tree will probably be slower than the existing algorithm):

Image

@suibianwanwank
Copy link
Contributor

Sorry for mixing up the two issues and causing some confusion. I agree that vectorizing the current implementation and supporting expr in frame are orthogonal.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
performance Make DataFusion faster
Projects
None yet
Development

No branches or pull requests

6 participants