Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

[Epic] High cardinality aggregation performance wishlist #11679

Open
1 of 4 tasks
alamb opened this issue Jul 26, 2024 · 4 comments
Open
1 of 4 tasks

[Epic] High cardinality aggregation performance wishlist #11679

alamb opened this issue Jul 26, 2024 · 4 comments
Labels
enhancement New feature or request

Comments

@alamb
Copy link
Contributor

alamb commented Jul 26, 2024

Is your feature request related to a problem or challenge?

DataFusion uses a two phase approach to aggregation (see Accumulator::state) for details:

                              ▲
                              │                   evaluate() is called to
                              │                   produce the final aggregate
                              │                   value per group
                              │
                 ┌─────────────────────────┐
                 │GroupBy                  │
                 │(AggregateMode::Final)   │      state() is called for each
                 │                         │      group and the resulting
                 └─────────────────────────┘      RecordBatches passed to the
                              ▲
                              │
             ┌────────────────┴───────────────┐
             │                                │
             │                                │
┌─────────────────────────┐      ┌─────────────────────────┐
│        GroubyBy         │      │        GroubyBy         │
│(AggregateMode::Partial) │      │(AggregateMode::Partial) │
└─────────────────────────┘      └────────────▲────────────┘
             ▲                                │
             │                                │    update_batch() is called for
             │                                │    each input RecordBatch
        .─────────.                      .─────────.
     ,─'           '─.                ,─'           '─.
    ;      Input      :              ;      Input      :
    :   Partition 0   ;              :   Partition 1   ;
     ╲               ╱                ╲               ╱
      '─.         ,─'                  '─.         ,─'
         `───────'                        `───────'

For low cardinality aggregates (where there are a few distinct groups), this works great 👌 👨‍🍳

However for high cardinality aggregates (where there are many millions of groups), we can do better by optimizing the path. See the background and ASCII art on #7957 for why the intermediate cardinality increases

This is my wishlist for improving high cardinality aggregates (ideally for the next blog post in a few months #11631 )

Together with the StringView work in #10918 that @XiangpengHao @a10y and others are working on, I think it would provide some very compelling overall speedups in ClickBench and TPCH queries

Also I hear that @avantgardnerio may be interested in helping here

Describe the solution you'd like

Here is my wishlist:

Describe alternatives you've considered

Do nothing and let DuckDB pass us by ;)

Additional context

Other potential things to do:

@alamb
Copy link
Contributor Author

alamb commented Jul 30, 2024

@alamb
Copy link
Contributor Author

alamb commented Aug 5, 2024

#6937 -- bam

@alamb
Copy link
Contributor Author

alamb commented Sep 25, 2024

@alamb
Copy link
Contributor Author

alamb commented Nov 5, 2024

Here is another great improvement: #12996

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant