-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Move min and max to user defined aggregate function, remove AggregateFunction
/ AggregateFunctionDefinition::BuiltIn
#11013
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Conversation
I do have something that's starting to look reasonable, but some tests on the optimizer now are failing for some reasons I can't understand
|
I guess you skip the aggregate statistic optimization for min/max datafusion/datafusion/core/src/physical_optimizer/aggregate_statistics.rs Lines 177 to 224 in 18042fd
You might need to check if the |
I fixed this but now I have a test that doesn't pass on the optimizer (there are two actually)
That suggests that the optimizer cannot use the existing aliases / doesn't understand the existing aliases that provide DISTINCT test.b . Looking, any tip would be highly appreciated |
I think we should add distinct for MIN/MAX so we can get the But I think there is no difference between MIN and Distinct Min, maybe we could remove distinct for MIN/MAX beforehand? Introduce EliminateDistinct optimize rule for MIN/MAX. |
Is this a part of the optimizer i.e. https://github.com/edmondop/arrow-datafusion/blob/main/datafusion/optimizer/src/replace_distinct_aggregate.rs ? Thank your for your help btw |
I don't think so, Distinct/Distinct On is different from distinct in the function. |
@jayzhan211 I have started experimenting with an optimizer rule, but removing the distinct result in such an error:
Do I need to change also the equivalence rules? |
You can take |
Thanks. I guess I wasn't clear in my comment here #11013 (comment) . How should that test failure be addressed? It seems that min/max udaf uses other aliases and is not reusing the intermediate results already available |
If we eliminate distinct of min/max prior to |
Wouldn't eliminating it require the optimizer rule? Or do you suggest I update the test case? Or the expected value? |
Yes, I suggest we update the test like #[test]
fn one_distinct_and_two_common() -> Result<()> {
let table_scan = test_table_scan()?;
let plan = LogicalPlanBuilder::from(table_scan)
.aggregate(
vec![col("a")],
vec![sum(col("c")), count_distinct(col("b")), max(col("b"))],
)?
.build()?;
// Should work
let expected = "Projection: test.a, sum(alias2) AS sum(test.c), COUNT(alias1) AS COUNT(DISTINCT test.b), MAX(alias3) AS MAX(test.b) [a:UInt32, sum(test.c):UInt64;N, COUNT(DISTINCT test.b):Int64;N, MAX(test.b):UInt32;N]\n Aggregate: groupBy=[[test.a]], aggr=[[sum(alias2), COUNT(alias1), MAX(alias3)]] [a:UInt32, sum(alias2):UInt64;N, COUNT(alias1):Int64;N, MAX(alias3):UInt32;N]\n Aggregate: groupBy=[[test.a, test.b AS alias1]], aggr=[[sum(test.c) AS alias2, MAX(test.b) AS alias3]] [a:UInt32, alias1:UInt32, alias2:UInt64;N, alias3:UInt32;N]\n TableScan: test [a:UInt32, b:UInt32, c:UInt32]";
assert_optimized_plan_equal(plan, expected)
} |
There seems to be a column added to the Aggregate node in the logical plan, can that affect performance and/or memory footprint? This was the reason why I didn't update the test in the first place This is a subset of the new plan
while this is the subset from the previous plan
there is an alias3:UInt64 that gets added |
Remove the Min/Max matching in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you so much @edmondop -- I took a look at this PR and I think in general it is quite close.
It needs:
- to remove the old min/max implementation in https://github.com/apache/datafusion/blob/5bb6b356277ea1c6f1d7af64e2d66f005d7e1ed4/datafusion/physical-expr/src/aggregate/min_max.rs
- resolve some merge conflicts
There is also a follow on issue / PR I would like to make regarding the optimizer check
Given this PR has hung out for a while and has some merge conflicts now I am going to try and help polish it up
I think as long as you can explain me how to resolve the current test failure I should be fine. Agree using names for min and max unwrapping is not very robust |
@jayzhan211 fyi all the tests in scalar subquery failed with the stubs, I restored import that used the real implementation |
wait, we should use stubs instead of the real function, otherwise we import the |
You use the name "min" in Max, and "max" in Min for stubs |
It is available, I hadn't have to do anything / update the Cargo |
I see. We don't need stubs. |
Do you expect any additional changes ? I fixed the stub function name anyways |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
self.nullable, | ||
)) | ||
fn name(&self) -> &str { | ||
"MIN" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Follow up PR is to lowercase the name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I logged this #11779 I am not sure we should lowercase this or make the other UDF uppercase
|
||
impl std::fmt::Debug for Max { | ||
fn fmt(&self, f: &mut std::fmt::Formatter) -> std::fmt::Result { | ||
f.debug_struct("Min") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be "Max"
AggregateFunction
AggregateFunction
AggregateFunction
/ AggregateFunctionDefinition::BuiltIn
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @edmondop and @jayzhan211 -- this is epic work and I think the codebase is that much better for it 🙏

After this PR is merged i think we can also remove the AggregateFunctionDefition
enum that now only has a single variant as a follow on cleanup
println!( | ||
"Copying {} to {}", | ||
prost.clone().display(), | ||
proto_dir.join("src/generated/prost.rs").display() | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is fine to print out some status reporting while regenerating protos 👍
@@ -466,51 +464,6 @@ message InListNode { | |||
bool negated = 3; | |||
} | |||
|
|||
enum AggregateFunction { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -477,30 +477,10 @@ impl AsExecutionPlan for protobuf::PhysicalPlanNode { | |||
ExprType::AggregateExpr(agg_node) => { | |||
let input_phy_expr: Vec<Arc<dyn PhysicalExpr>> = agg_node.expr.iter() | |||
.map(|e| parse_physical_expr(e, registry, &physical_schema, extension_codec)).collect::<Result<Vec<_>>>()?; | |||
let ordering_req: Vec<PhysicalSortExpr> = agg_node.ordering_req.iter() | |||
let _ordering_req: Vec<PhysicalSortExpr> = agg_node.ordering_req.iter() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is an interesting change -- does it mean ordering is not carried into the udf?
Or maybe it is redundant and now is entirely determined by the udf
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ordering are not supported for udaf yet. I had leave a TODO comment below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO:
|
@jayzhan211 should we file tickets to track this work (I think @edmondop has a PR for the lower case name). The I am happy to file the tickets if you would like |
Sure, thanks |
#11805 <-- comment PR |
Which issue does this PR close?
Closes #10943 .
Rationale for this change
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?