-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Vectorize window functions #15607
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Comments
I'll try to look into it, but it may take some time. If anyone else is working on this, I'd be happy to help in any way I can:) |
I think we should start with creating a benchmark -- ideally a standard one that contains window functions |
@suibianwanwank @alamb Would it be okay for me to take a stab on this issue? I'll ask for help if I get stuck :) |
@alamb I noticed we already have |
I highly recommend running the benchmark with the |
@Dandandan I've reviewed the current implementation of window functions in DataFusion and studied some related concepts. This DuckDB blog post, Flying Through Windows, describes their efforts in optimizing window functions. However, in practice, many of these optimizations are function-specific, and I don't yet have a good idea for implementing a general-purpose vectorized solution. Additionally, I noticed that we haven't implemented the segment tree and vectorized optimizations mentioned in the DuckDB blog. This is because our current window frame implementation does not yet support flexible expr in frames. Before proceeding with such optimizations, should we first implement the feature described in ISSUE-15714? Btw, I attempted to support streaming of aggregation results for different ranges within the same partition in BoundedWindowAggStream. But current benchmarks haven’t shown significant performance gains, with most of the time spent in calculate_range. I plan to add more complex benchmark case to assess performance improvements. |
Yeah I am not sure that blog post has a lot to offer DataFusion
It seems like vectorized implementations would be a great first step but they may be independent of calculating the window ranges. In terms of Segment Trees, I know the TUM folks wrote a paper about how great it was but to my knowledge the technique is only implemented in Umbra and DuckDB. The content of Flying Through Windows certainly implies to me that there isn't really a "well known" pattern / easy to implement segment tree approach I think the technique of sorting the windows by ORDER BY/ PARTITION BY (which Datafusion implements) is far more common and has many nice properties (like the sorting can take advantage of the work to improve larger than memory sorting)
Thank you! FYI @akurmustafa / @mustafasrepo in case you have other thoughts on this matter |
Thanks for pushing this feature. Currently, we don't have support for flexible expressions in window frames. Naive implementation of this feature would be According to me, vectorization and support for flexible expressions in window frames (naive implementation) are kind of orthogonal. I think, we can either implement support for flexible window frames with naive implementation or vectorization first. I also think that Segment tree is kind of non-standard and has benefits only for some specific use cases. Segment tree should be done after these 2 stages (if it will be done at some point, also these 2 stages are kind of preliminary to trigger the use case for Segment tree).
@suibianwanwank, we can implement the feature in issue15714 first definitely. However, we shold start with the naive implementation (where window frame boundaries are calculated each time using binary search) without segment tree. |
It's a good idea to start with more benchmarks and expr in window range feature. AFAIK, there is no well-known end-to-end benchmark focused on different types of window functions (for example ClickBench focuses on aggregations, but we lack a similar benchmark targeting window functions). Designing one specifically for window functions could provide significant value. DuckDB maintains many micro-benchmarks for window functions, which might serve as good inspiration. We could also include known anti-patterns that the segment tree approach aims to address. Segment tree optimization is a "worst-case optimal" technique, but (to me) it's unclear what kind of workload would actually trigger such worst cases. In most common scenarios, window functions behave more like sliding windows with small border adjustments, which are unlikely to cause major slowdowns. Therefore, it might be worthwhile to first identify such anti-patterns to justify the effort required to implement segment tree optimizations. |
My idea for the scope of this issue wasn't to support arbitrary expressions in the window frames / implement
I think for the constant value the non-segment tree algorithm probably is the best choice anyway (and we should keep it), looking at this table from the paper linked by @alamb (even when same complexity updating the segment tree will probably be slower than the existing algorithm): ![]() |
Sorry for mixing up the two issues and causing some confusion. I agree that vectorizing the current implementation and supporting expr in frame are orthogonal. |
Is your feature request related to a problem or challenge?
[More info to be added later]
DataFusion has quite good support for window functions, but (part of) the window function implementation is not yet (I guess mostly because we don't run popular benchmarks with window functions, tpch and clickbench don't have any).
We can try giving it a similar treatment as aggregates (e.g. GroupsAccumulator).
Describe the solution you'd like
Design a strategy and implement vectorized code for window functions (sum/count, etc) where possible.
Benchmark the improvement.
Describe alternatives you've considered
No response
Additional context
[Add a profiling image / info here]
The text was updated successfully, but these errors were encountered: