[Epic] High cardinality aggregation performance wishlist #11679

alamb · 2024-07-26T19:48:11Z

Is your feature request related to a problem or challenge?

DataFusion uses a two phase approach to aggregation (see Accumulator::state) for details:

                              ▲
                              │                   evaluate() is called to
                              │                   produce the final aggregate
                              │                   value per group
                              │
                 ┌─────────────────────────┐
                 │GroupBy                  │
                 │(AggregateMode::Final)   │      state() is called for each
                 │                         │      group and the resulting
                 └─────────────────────────┘      RecordBatches passed to the
                              ▲
                              │
             ┌────────────────┴───────────────┐
             │                                │
             │                                │
┌─────────────────────────┐      ┌─────────────────────────┐
│        GroubyBy         │      │        GroubyBy         │
│(AggregateMode::Partial) │      │(AggregateMode::Partial) │
└─────────────────────────┘      └────────────▲────────────┘
             ▲                                │
             │                                │    update_batch() is called for
             │                                │    each input RecordBatch
        .─────────.                      .─────────.
     ,─'           '─.                ,─'           '─.
    ;      Input      :              ;      Input      :
    :   Partition 0   ;              :   Partition 1   ;
     ╲               ╱                ╲               ╱
      '─.         ,─'                  '─.         ,─'
         `───────'                        `───────'

For low cardinality aggregates (where there are a few distinct groups), this works great 👌 👨‍🍳

However for high cardinality aggregates (where there are many millions of groups), we can do better by optimizing the path. See the background and ASCII art on #7957 for why the intermediate cardinality increases

This is my wishlist for improving high cardinality aggregates (ideally for the next blog post in a few months #11631 )

Together with the StringView work in #10918 that @XiangpengHao @a10y and others are working on, I think it would provide some very compelling overall speedups in ClickBench and TPCH queries

Also I hear that @avantgardnerio may be interested in helping here

Describe the solution you'd like

Here is my wishlist:

Improve Memory usage + performance with large numbers of groups / High Cardinality Aggregates #6937 (@korowa has a PR up for this one)
Avoid extra copies in CoalesceBatchesExec to improve performance #7957 (I have a prototype and some ideas)
Improve performance of high cardinality grouping by reusing hash values #11680

Describe alternatives you've considered

Do nothing and let DuckDB pass us by ;)

Additional context

Other potential things to do:

[EPIC] (Even More) Grouping / Group By / Aggregation Performance #7000

The text was updated successfully, but these errors were encountered:

alamb · 2024-07-30T12:02:36Z

Bonus items:

Check saved hash first during probing bucket in aggr hash table #11717 from @Rachelint

alamb · 2024-08-05T11:10:36Z

#6937 -- bam

alamb · 2024-09-25T11:50:22Z

Update here

Skipping partial aggregation when it is not helping for high cardinality aggregates #11627 -- improvement from @korowa
Avoid RowConverter for multi column grouping (10% faster clickbench queries) #12269 -- improvement from @jayzhan211

alamb · 2024-11-05T18:38:08Z

Here is another great improvement: #12996

alamb added the enhancement New feature or request label Jul 26, 2024

This was referenced Jul 26, 2024

Improve performance of high cardinality grouping by reusing hash values #11680

Open

DataFusion weekly project plan (Andrew Lamb) - July 29, 2024 #11710

Closed

alamb mentioned this issue Aug 5, 2024

DataFusion weekly project plan (Andrew Lamb) - Aug 5, 2024 #11826

Closed

6 tasks

This was referenced Aug 14, 2024

DataFusion weekly project plan (Andrew Lamb) - Aug 12, 2024 #11986

Closed

DataFusion weekly project plan (Andrew Lamb) - Aug 19, 2024 + Sep 2, 2024 #12066

Closed

This was referenced Sep 5, 2024

DataFusion weekly project plan (Andrew Lamb) - Sep 2, 2024 #12336

Closed

DataFusion weekly project plan (Andrew Lamb) - Sep 9, 2024 #12391

Closed

alamb mentioned this issue Sep 16, 2024

DataFusion weekly project plan (Andrew Lamb) - Sep 16, 2024 #12494

Closed

8 tasks

alamb mentioned this issue Nov 5, 2024

Support vectorized append and compare for multi group by #12996

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Epic] High cardinality aggregation performance wishlist #11679

[Epic] High cardinality aggregation performance wishlist #11679

alamb commented Jul 26, 2024 •

edited

Loading

alamb commented Jul 30, 2024 •

edited

Loading

alamb commented Aug 5, 2024

alamb commented Sep 25, 2024

alamb commented Nov 5, 2024

[Epic] High cardinality aggregation performance wishlist #11679

[Epic] High cardinality aggregation performance wishlist #11679

Comments

alamb commented Jul 26, 2024 • edited Loading

Is your feature request related to a problem or challenge?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

alamb commented Jul 30, 2024 • edited Loading

alamb commented Aug 5, 2024

alamb commented Sep 25, 2024

alamb commented Nov 5, 2024

alamb commented Jul 26, 2024 •

edited

Loading

alamb commented Jul 30, 2024 •

edited

Loading