-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Create a Priority Queue based Aggregation with limit
#7192
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Create a Priority Queue based Aggregation with limit
#7192
Conversation
I plan to give this a look later today -- thank you @avantgardnerio |
The more I think about this code / approach the more I like it ❤️ -- I spent some time writing up how I think this basic strategy can be applied to all the various TopK type queries at #7198 (comment) I think my writeup assumes a slightly different packaging / deciding how to invoke this operator, but the basic idea I think is the same. Thank you for sparking this @avantgardnerio |
222c458
to
8729398
Compare
Would anyone be able to provide advice on debugging sql logic tests? This error doesn't seem very informative.. I'd expect to see more of a diff than this:?
|
The docs are here: using cargo test --test sqllogictests -- --complete Would likely save you time I believe that diff says a new line was added to the explain plan (which makes sense if you have added a new optimizer pass) |
TLDR: with the naive, unoptimized version in place, it looks to be 2X slower according to a test with realistic data: ![]() This is based upon the fact that currently, the normal aggregation is running twice or with the rule enabled 1 of each.
I'm not super worried because:
No matter what, this rule is much more memory efficient. I'll pass the limit down the tree and we'll see if I'm right and we match speed. |
We can see it doing the right thing now:
but very slowly (debug mode is 10x, divide by 10 for release):
Edit: it's almost like there is some high, fixed cost to running this stream 🤔 Welp, at least testing is in place. I'll start tracking down performance issues tomorrow. |
@avantgardnerio seems best to profile it ATM and see where the most time is spent |
f5f70f0
to
dda0c17
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current PR is looking good to me, I think in a good shape to be merged and to be continued/extended.
I've one small remaining comment about the rand
dependency.
Let me know if there is anything I can do for this PR -- I think merging the PR and continuing to iterate would be a fine idea, given how long this one has been outstanding and how large it has grown |
Thanks, I was waiting for a non-coralogix ✅ since I introduced a bunch of |
I am backed up on reviews as I was off last week. I will try and find time to review this tomorrow |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @avantgardnerio -- I didn't review the implementation in detail but I skimmed it and it looked solid to me (and I trust that @Dandandan and @thinkharderdev 's attention is sufficient.
I think this PR is almost ready to merge, the only things I think it needs are:
- An end to end test for actually limiting the values: https://github.com/apache/arrow-datafusion/pull/7192/files#r1301686217
- The follow on work suggested by @ozankabak in https://github.com/apache/arrow-datafusion/pull/7192/files#r1308198186
Also, if someone wanted to change this code in the future, are there benchmarks that would catch any performance regressions?
datafusion/core/src/physical_plan/aggregates/topk/hash_table.rs
Outdated
Show resolved
Hide resolved
|
||
|
||
query TI | ||
select trace_id, MAX(timestamp) from traces group by trace_id order by MAX(timestamp) desc limit 4; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do think it is important to have an end to end that that actually limits the number of values coming out - as I mentioned here I think this test only has 4 distinct groups and thus a limit 4
doesn't actually do any limiting.
891189f
to
2552543
Compare
There is a benchmark. I'm not sure... I think the github action fails if that regresses? |
I added some limit 3 tests. |
Which issue does this PR close?
Closes #7191.
Rationale for this change
Described in issue.
What changes are included in this PR?
GroupedTopKAggregateStream
aggregationlimit
property onAggregateExec
SortExec
if applicableAre these changes tested?
AggregateExec
now printslim=X
if there's a limit, and I added some tests to assert thissqllogictest
s to compare to existing functionalityAre there any user-facing changes?
I probably broke other things so this is a draftAll the existing tests now passNotes
Concerns to address:
themost queries will use a single columnOwnedRow
code is not columnar, vectorized, etcuse the existing Acculumators?not required since this is only min/maxfilters are not yet appliedunsupported edge case for nowExec
node, not just a newStream
type?key types other thannow supports String + all primitive keysString
TreeMap
with custom index-based heapOut of scope
OwnedRow