-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Add 'clickbench_extended' benchmark #8860
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Comments
Hi @alamb , do you know why these are failing to use the code in my PRs? I think there are a number of cases that they could be extended to cover, without an extreme amount of effort. |
If you mean "why isn't the code added in #7192 used in the ClickBench queries" I think it is because none of the SELECT min(time), group_id FROM ... GROUP BY group_id ORDER BY min(time) LIMIT ... |
ah, okay. I was hoping some queries were similar enough to be accelerated by that code, given a little effort, but it does not appear that way. |
Indeed -- from my perspective the ClickBench queries can not be changed (as we don't define them, they come from the clickbench benchmark itself). That is why I propose adding an extended version here. What I would like to do is to come up with some queries that do have the pattern that is accelerated that use the same data. What about something like: "Find the top 10 most recently active users" WITH hits
AS® (SELECT "UserID", "EventDate"::INT::DATE::TIMESTAMP + arrow_cast("EventTime", 'Interval(DayTime)') as time from 'hits.parquet')
SELECT min(time), "UserID" from hits GROUP BY "UserID" ORDER BY min(time) limit 10; Though this query appears not to use the optimization 🤔 ❯ explain WITH hits AS (SELECT "UserID", "EventDate"::INT::DATE::TIMESTAMP + arrow_cast("EventTime", 'Interval(DayTime)') as time from 'hits.parquet') SELECT min(time), "UserID" from hits GROUP BY "UserID" ORDER BY min(time) limit 10;
+---------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| plan_type | plan |
+---------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| logical_plan | Limit: skip=0, fetch=10 |
| | Sort: MIN(hits.time) ASC NULLS LAST, fetch=10 |
| | Projection: MIN(hits.time), hits.UserID |
| | Aggregate: groupBy=[[hits.UserID]], aggr=[[MIN(hits.time)]] |
| | SubqueryAlias: hits |
| | Projection: hits.parquet.UserID, CAST(CAST(CAST(hits.parquet.EventDate AS Int32) AS Date32) AS Timestamp(Nanosecond, None)) + CAST(hits.parquet.EventTime AS Interval(DayTime)) AS time |
| | TableScan: hits.parquet projection=[EventTime, EventDate, UserID] |
| physical_plan | GlobalLimitExec: skip=0, fetch=10 |
| | SortPreservingMergeExec: [MIN(hits.time)@0 ASC NULLS LAST], fetch=10 |
| | SortExec: TopK(fetch=10), expr=[MIN(hits.time)@0 ASC NULLS LAST] |
| | ProjectionExec: expr=[MIN(hits.time)@1 as MIN(hits.time), UserID@0 as UserID] |
| | AggregateExec: mode=FinalPartitioned, gby=[UserID@0 as UserID], aggr=[MIN(hits.time)] |
| | CoalesceBatchesExec: target_batch_size=8192 |
| | RepartitionExec: partitioning=Hash([UserID@0], 16), input_partitions=16 |
| | AggregateExec: mode=Partial, gby=[UserID@0 as UserID], aggr=[MIN(hits.time)] |
| | ProjectionExec: expr=[UserID@2 as UserID, CAST(CAST(CAST(EventDate@1 AS Int32) AS Date32) AS Timestamp(Nanosecond, None)) + CAST(EventTime@0 AS Interval(DayTime)) as time] |
| | ParquetExec: file_groups={16 groups: [[Users/andrewlamb/Downloads/hits.parquet:0..923748528], [Users/andrewlamb/Downloads/hits.parquet:923748528..1847497056], [Users/andrewlamb/Downloads/hits.parquet:1847497056..2771245584], [Users/andrewlamb/Downloads/hits.parquet:2771245584..3694994112], [Users/andrewlamb/Downloads/hits.parquet:3694994112..4618742640], ...]}, projection=[EventTime, EventDate, UserID] |
| | |
+---------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
2 rows in set. Query took 0.048 seconds. |
Is your feature request related to a problem or challenge?
The ClickBench benchmark has excellent coverage for aggregate / grouping
We have used the clickbench benchmark, run via
bench.sh
, for important work improving aggregates such as #6988 and #7064. However there are some important optimizations like #8849 and #7191 from @avantgardnerio where the clickbench benchmark does not cover the existing usecaseFor example, @jayzhan211 's change in #8849 (comment) makes certain realistic queries
Details on `bench.sh`
Describe the solution you'd like
I would like to add a new benchmark to
bench.sh
that uses the same dataset but has different queries than the existingThe new queries should be
Here is an example from #8849 (comment)
Query: Distinct counts
Query Explanation: Data exploration: understand the qualities of the data in
hits.parquet
Query Properties: multiple count distinct aggregates on string datatypes
Describe alternatives you've considered
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: