-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Optimize COUNT(1)
: Change the sentinel value's type for COUNT(*) to Int64
#9944
Conversation
e0ecfc6
to
dbc5020
Compare
Might as well try whether /benchmark works as intended now. |
dbc5020
to
c27e9c2
Compare
Benchmark resultsBenchmarks comparing 4bd7c13 (main) and dbc5020 (PR)
|
@@ -42,7 +42,7 @@ use sqlparser::ast::{ExceptSelectItem, ExcludeSelectItem, WildcardAdditionalOpti | |||
|
|||
/// The value to which `COUNT(*)` is expanded to in | |||
/// `COUNT(<constant>)` expressions | |||
pub const COUNT_STAR_EXPANSION: ScalarValue = ScalarValue::UInt8(Some(1)); | |||
pub const COUNT_STAR_EXPANSION: ScalarValue = ScalarValue::Int64(Some(1)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does however result in count(1)
being auto-named to count(*)
now.
❯ explain select count(1) from hits;
+---------------+---------------------------------------------------+
| plan_type | plan |
+---------------+---------------------------------------------------+
| logical_plan | Aggregate: groupBy=[[]], aggr=[[COUNT(Int64(1))]] |
| | TableScan: hits projection=[] |
| physical_plan | ProjectionExec: expr=[99997497 as COUNT(*)] |
| | PlaceholderRowExec |
| | |
+---------------+---------------------------------------------------+
2 row(s) fetched.
Elapsed 0.263 seconds.
❯ select count(1) from hits;
+----------+
| COUNT(*) |
+----------+
| 99997497 |
+----------+
1 row(s) fetched.
Elapsed 0.015 seconds.
That might be fine as it reduces ambiguity.
Otherwise it could be resolved by e.g.:
diff --git a/datafusion/core/src/physical_optimizer/aggregate_statistics.rs b/datafusion/core/src/physical_optimizer/aggregate_statistics.rs
index df5422227..0e35b6a2b 100644
--- a/datafusion/core/src/physical_optimizer/aggregate_statistics.rs
+++ b/datafusion/core/src/physical_optimizer/aggregate_statistics.rs
@@ -160,6 +160,11 @@ fn take_optimizable_table_count(
ScalarValue::Int64(Some(num_rows as i64)),
COUNT_STAR_NAME,
));
+ } else if lit_expr.value() == &ScalarValue::Int64(Some(1)) {
+ return Some((
+ ScalarValue::Int64(Some(num_rows as i64)),
+ "COUNT(1)",
+ ));
}
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does however result in count(1) being auto-named to count(*) now.
FWIW, this is what happens on main with count(uint8)
(the same thing)
❯ select count(arrow_cast(1, 'UInt8')) from (values (1));
+----------+
| COUNT(*) |
+----------+
| 1 |
+----------+
1 row(s) fetched.
Elapsed 0.002 seconds.
Though admittedly almost no one would actually type count(arrow_cast(1, 'UInt8'))
so it is likely not a big deal.
That might be fine as it reduces ambiguity.
I agree having the COUNT(*)
in the plan actually helps as it makes it clearer when the fast path is being used.
Let's start with this and if someone else has a different opinion we can make another PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @gruuya -- this PR makes senes to me. I have definitely seen people do SELECT COUNT(1)
rather than SELECT COUNT(*)
so having them have the same performance seems good 👨🍳 👌
@@ -42,7 +42,7 @@ use sqlparser::ast::{ExceptSelectItem, ExcludeSelectItem, WildcardAdditionalOpti | |||
|
|||
/// The value to which `COUNT(*)` is expanded to in | |||
/// `COUNT(<constant>)` expressions | |||
pub const COUNT_STAR_EXPANSION: ScalarValue = ScalarValue::UInt8(Some(1)); | |||
pub const COUNT_STAR_EXPANSION: ScalarValue = ScalarValue::Int64(Some(1)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does however result in count(1) being auto-named to count(*) now.
FWIW, this is what happens on main with count(uint8)
(the same thing)
❯ select count(arrow_cast(1, 'UInt8')) from (values (1));
+----------+
| COUNT(*) |
+----------+
| 1 |
+----------+
1 row(s) fetched.
Elapsed 0.002 seconds.
Though admittedly almost no one would actually type count(arrow_cast(1, 'UInt8'))
so it is likely not a big deal.
That might be fine as it reduces ambiguity.
I agree having the COUNT(*)
in the plan actually helps as it makes it clearer when the fast path is being used.
Let's start with this and if someone else has a different opinion we can make another PR
COUNT(1)
: Change the sentinel value's type for COUNT(*) to Int64
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense to me, nice find 👍
Thanks again @gruuya |
Which issue does this PR close?
Closes #9943.
Rationale for this change
Make
COUNT(1)
equivalent toCOUNT(*)
.What changes are included in this PR?
Change the type of
COUNT_STAR_EXPANSION
toScalarValue::Int64
.Are these changes tested?
Yes, with revised existing tests and a new one that tests the optimization for
count(1)
case.Are there any user-facing changes?