Optimize `COUNT(1)`: Change the sentinel value's type for COUNT(*) to Int64 #9944

gruuya · 2024-04-04T10:52:57Z

Which issue does this PR close?

Closes #9943.

Rationale for this change

Make COUNT(1) equivalent to COUNT(*).

What changes are included in this PR?

Change the type of COUNT_STAR_EXPANSION to ScalarValue::Int64.

Are these changes tested?

Yes, with revised existing tests and a new one that tests the optimization for count(1) case.

Are there any user-facing changes?

gruuya · 2024-04-04T10:55:01Z

Might as well try whether /benchmark works as intended now.

github-actions · 2024-04-04T11:20:56Z

Benchmark results

Benchmarks comparing 4bd7c13 (main) and dbc5020 (PR)

Comparing 4bd7c13 and dbc5020
--------------------
Benchmark tpch_mem_sf1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃  4bd7c13 ┃  dbc5020 ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1     │ 291.81ms │ 293.02ms │     no change │
│ QQuery 2     │  39.03ms │  40.57ms │     no change │
│ QQuery 3     │  58.57ms │  59.64ms │     no change │
│ QQuery 4     │ 103.21ms │  79.51ms │ +1.30x faster │
│ QQuery 5     │ 130.88ms │  99.08ms │ +1.32x faster │
│ QQuery 6     │  16.41ms │  16.31ms │     no change │
│ QQuery 7     │ 233.86ms │ 231.76ms │     no change │
│ QQuery 8     │  43.20ms │  44.62ms │     no change │
│ QQuery 9     │ 126.29ms │ 118.55ms │ +1.07x faster │
│ QQuery 10    │ 109.28ms │ 111.10ms │     no change │
│ QQuery 11    │  46.48ms │  45.68ms │     no change │
│ QQuery 12    │  59.06ms │  59.39ms │     no change │
│ QQuery 13    │ 107.43ms │ 106.03ms │     no change │
│ QQuery 14    │  18.85ms │  18.99ms │     no change │
│ QQuery 15    │  31.99ms │  31.77ms │     no change │
│ QQuery 16    │  46.67ms │  47.47ms │     no change │
│ QQuery 17    │ 139.18ms │ 139.44ms │     no change │
│ QQuery 18    │ 512.08ms │ 590.62ms │  1.15x slower │
│ QQuery 19    │  64.66ms │  63.61ms │     no change │
│ QQuery 20    │ 121.75ms │ 119.54ms │     no change │
│ QQuery 21    │ 332.45ms │ 340.13ms │     no change │
│ QQuery 22    │  39.49ms │  39.01ms │     no change │
└──────────────┴──────────┴──────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary      ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (4bd7c13)   │ 2672.63ms │
│ Total Time (dbc5020)   │ 2695.85ms │
│ Average Time (4bd7c13) │  121.48ms │
│ Average Time (dbc5020) │  122.54ms │
│ Queries Faster         │         3 │
│ Queries Slower         │         1 │
│ Queries with No Change │        18 │
└────────────────────────┴───────────┘
--------------------
Benchmark tpch_sf1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━┓
┃ Query        ┃  4bd7c13 ┃  dbc5020 ┃       Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━┩
│ QQuery 1     │ 434.65ms │ 438.47ms │    no change │
│ QQuery 2     │  55.96ms │  56.92ms │    no change │
│ QQuery 3     │ 143.14ms │ 146.84ms │    no change │
│ QQuery 4     │  86.33ms │  95.97ms │ 1.11x slower │
│ QQuery 5     │ 197.43ms │ 202.88ms │    no change │
│ QQuery 6     │ 107.99ms │ 108.68ms │    no change │
│ QQuery 7     │ 274.78ms │ 293.95ms │ 1.07x slower │
│ QQuery 8     │ 190.78ms │ 195.93ms │    no change │
│ QQuery 9     │ 288.65ms │ 300.95ms │    no change │
│ QQuery 10    │ 231.57ms │ 234.50ms │    no change │
│ QQuery 11    │  62.39ms │  62.23ms │    no change │
│ QQuery 12    │ 124.30ms │ 125.27ms │    no change │
│ QQuery 13    │ 172.85ms │ 177.04ms │    no change │
│ QQuery 14    │ 128.32ms │ 126.50ms │    no change │
│ QQuery 15    │ 190.29ms │ 195.11ms │    no change │
│ QQuery 16    │  51.30ms │  50.32ms │    no change │
│ QQuery 17    │ 307.81ms │ 298.61ms │    no change │
│ QQuery 18    │ 442.69ms │ 443.12ms │    no change │
│ QQuery 19    │ 230.46ms │ 229.31ms │    no change │
│ QQuery 20    │ 189.53ms │ 191.92ms │    no change │
│ QQuery 21    │ 321.44ms │ 328.69ms │    no change │
│ QQuery 22    │  53.32ms │  51.77ms │    no change │
└──────────────┴──────────┴──────────┴──────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary      ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (4bd7c13)   │ 4285.97ms │
│ Total Time (dbc5020)   │ 4354.99ms │
│ Average Time (4bd7c13) │  194.82ms │
│ Average Time (dbc5020) │  197.95ms │
│ Queries Faster         │         0 │
│ Queries Slower         │         2 │
│ Queries with No Change │        20 │
└────────────────────────┴───────────┘
--------------------
Benchmark tpch_sf10.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Query        ┃   4bd7c13 ┃   dbc5020 ┃    Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━┩
│ QQuery 1     │ 4249.17ms │ 4276.34ms │ no change │
│ QQuery 2     │  470.59ms │  485.87ms │ no change │
│ QQuery 3     │ 1718.23ms │ 1705.55ms │ no change │
│ QQuery 4     │  824.64ms │  828.50ms │ no change │
│ QQuery 5     │ 2155.19ms │ 2168.08ms │ no change │
│ QQuery 6     │ 1037.84ms │ 1037.32ms │ no change │
│ QQuery 7     │ 3553.98ms │ 3693.11ms │ no change │
│ QQuery 8     │ 2402.73ms │ 2441.64ms │ no change │
│ QQuery 9     │ 4135.17ms │ 4112.04ms │ no change │
│ QQuery 10    │ 2548.32ms │ 2545.29ms │ no change │
│ QQuery 11    │  570.79ms │  585.43ms │ no change │
│ QQuery 12    │ 1185.51ms │ 1187.38ms │ no change │
│ QQuery 13    │ 2316.59ms │ 2316.11ms │ no change │
│ QQuery 14    │ 1288.98ms │ 1288.92ms │ no change │
│ QQuery 15    │ 1963.60ms │ 1952.05ms │ no change │
│ QQuery 16    │  516.67ms │  516.08ms │ no change │
│ QQuery 17    │ 5240.30ms │ 5188.41ms │ no change │
│ QQuery 18    │ 6835.60ms │ 6946.31ms │ no change │
│ QQuery 19    │ 2234.32ms │ 2240.05ms │ no change │
│ QQuery 20    │ 2556.45ms │ 2588.11ms │ no change │
│ QQuery 21    │ 4333.60ms │ 4301.07ms │ no change │
│ QQuery 22    │  574.15ms │  553.15ms │ no change │
└──────────────┴───────────┴───────────┴───────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary      ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (4bd7c13)   │ 52712.46ms │
│ Total Time (dbc5020)   │ 52956.79ms │
│ Average Time (4bd7c13) │  2396.02ms │
│ Average Time (dbc5020) │  2407.13ms │
│ Queries Faster         │          0 │
│ Queries Slower         │          0 │
│ Queries with No Change │         22 │
└────────────────────────┴────────────┘

gruuya · 2024-04-04T12:16:44Z

datafusion/expr/src/utils.rs

@@ -42,7 +42,7 @@ use sqlparser::ast::{ExceptSelectItem, ExcludeSelectItem, WildcardAdditionalOpti

 ///  The value to which `COUNT(*)` is expanded to in
 ///  `COUNT(<constant>)` expressions
-pub const COUNT_STAR_EXPANSION: ScalarValue = ScalarValue::UInt8(Some(1));
+pub const COUNT_STAR_EXPANSION: ScalarValue = ScalarValue::Int64(Some(1));


This does however result in count(1) being auto-named to count(*) now.

❯ explain select count(1) from hits; +---------------+---------------------------------------------------+ | plan_type | plan | +---------------+---------------------------------------------------+ | logical_plan | Aggregate: groupBy=[[]], aggr=[[COUNT(Int64(1))]] | | | TableScan: hits projection=[] | | physical_plan | ProjectionExec: expr=[99997497 as COUNT(*)] | | | PlaceholderRowExec | | | | +---------------+---------------------------------------------------+ 2 row(s) fetched. Elapsed 0.263 seconds. ❯ select count(1) from hits; +----------+ | COUNT(*) | +----------+ | 99997497 | +----------+ 1 row(s) fetched. Elapsed 0.015 seconds.

That might be fine as it reduces ambiguity.

Otherwise it could be resolved by e.g.:

diff --git a/datafusion/core/src/physical_optimizer/aggregate_statistics.rs b/datafusion/core/src/physical_optimizer/aggregate_statistics.rs index df5422227..0e35b6a2b 100644 --- a/datafusion/core/src/physical_optimizer/aggregate_statistics.rs +++ b/datafusion/core/src/physical_optimizer/aggregate_statistics.rs @@ -160,6 +160,11 @@ fn take_optimizable_table_count( ScalarValue::Int64(Some(num_rows as i64)), COUNT_STAR_NAME, )); + } else if lit_expr.value() == &ScalarValue::Int64(Some(1)) { + return Some(( + ScalarValue::Int64(Some(num_rows as i64)), + "COUNT(1)", + )); } } }

This does however result in count(1) being auto-named to count(*) now.

FWIW, this is what happens on main with count(uint8) (the same thing)

❯ select count(arrow_cast(1, 'UInt8')) from (values (1)); +----------+ | COUNT(*) | +----------+ | 1 | +----------+ 1 row(s) fetched. Elapsed 0.002 seconds.

Though admittedly almost no one would actually type count(arrow_cast(1, 'UInt8')) so it is likely not a big deal.

That might be fine as it reduces ambiguity.

I agree having the COUNT(*) in the plan actually helps as it makes it clearer when the fast path is being used.

Let's start with this and if someone else has a different opinion we can make another PR

alamb

Thanks @gruuya -- this PR makes senes to me. I have definitely seen people do SELECT COUNT(1) rather than SELECT COUNT(*) so having them have the same performance seems good 👨‍🍳 👌

alamb · 2024-04-04T16:23:56Z

datafusion/expr/src/utils.rs

@@ -42,7 +42,7 @@ use sqlparser::ast::{ExceptSelectItem, ExcludeSelectItem, WildcardAdditionalOpti

 ///  The value to which `COUNT(*)` is expanded to in
 ///  `COUNT(<constant>)` expressions
-pub const COUNT_STAR_EXPANSION: ScalarValue = ScalarValue::UInt8(Some(1));
+pub const COUNT_STAR_EXPANSION: ScalarValue = ScalarValue::Int64(Some(1));


This does however result in count(1) being auto-named to count(*) now.

FWIW, this is what happens on main with count(uint8) (the same thing)

❯ select count(arrow_cast(1, 'UInt8')) from (values (1)); +----------+ | COUNT(*) | +----------+ | 1 | +----------+ 1 row(s) fetched. Elapsed 0.002 seconds.

Though admittedly almost no one would actually type count(arrow_cast(1, 'UInt8')) so it is likely not a big deal.

That might be fine as it reduces ambiguity.

I agree having the COUNT(*) in the plan actually helps as it makes it clearer when the fast path is being used.

Let's start with this and if someone else has a different opinion we can make another PR

Jefffrey

Makes sense to me, nice find 👍

alamb · 2024-04-05T11:13:48Z

Thanks again @gruuya

github-actions bot added sql SQL Planner logical-expr Logical plan and expressions optimizer Optimizer rules core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) substrait Changes to the substrait crate labels Apr 4, 2024

gruuya force-pushed the count-star-int-64-sentinel branch from e0ecfc6 to dbc5020 Compare April 4, 2024 10:53

Change the sentinel value's type for COUNT(*) to Int64

c27e9c2

gruuya force-pushed the count-star-int-64-sentinel branch from dbc5020 to c27e9c2 Compare April 4, 2024 10:55

gruuya commented Apr 4, 2024

View reviewed changes

alamb mentioned this pull request Apr 4, 2024

DataFusion weekly project plan (Andrew Lamb) - April 1, 2024 #9899

Closed

7 tasks

alamb approved these changes Apr 4, 2024

View reviewed changes

alamb changed the title ~~Change the sentinel value's type for COUNT(*) to Int64~~ Optimize COUNT(1): Change the sentinel value's type for COUNT(*) to Int64 Apr 4, 2024

Jefffrey approved these changes Apr 4, 2024

View reviewed changes

alamb merged commit 1553a36 into apache:main Apr 5, 2024
26 checks passed

gruuya deleted the count-star-int-64-sentinel branch April 5, 2024 12:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize `COUNT(1)`: Change the sentinel value's type for COUNT(*) to Int64 #9944

Optimize `COUNT(1)`: Change the sentinel value's type for COUNT(*) to Int64 #9944

gruuya commented Apr 4, 2024 •

edited

Loading

gruuya commented Apr 4, 2024

github-actions bot commented Apr 4, 2024

gruuya Apr 4, 2024 •

edited

Loading

alamb Apr 4, 2024

alamb left a comment

alamb Apr 4, 2024

Jefffrey left a comment

alamb commented Apr 5, 2024

Optimize COUNT(1): Change the sentinel value's type for COUNT(*) to Int64 #9944

Optimize COUNT(1): Change the sentinel value's type for COUNT(*) to Int64 #9944

Conversation

gruuya commented Apr 4, 2024 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

gruuya commented Apr 4, 2024

github-actions bot commented Apr 4, 2024

Benchmark results

gruuya Apr 4, 2024 • edited Loading

Choose a reason for hiding this comment

alamb Apr 4, 2024

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

alamb Apr 4, 2024

Choose a reason for hiding this comment

Jefffrey left a comment

Choose a reason for hiding this comment

alamb commented Apr 5, 2024

Optimize `COUNT(1)`: Change the sentinel value's type for COUNT(*) to Int64 #9944

Optimize `COUNT(1)`: Change the sentinel value's type for COUNT(*) to Int64 #9944

gruuya commented Apr 4, 2024 •

edited

Loading

gruuya Apr 4, 2024 •

edited

Loading