-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Generalized constant folding #1128
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Conversation
…ed constant folding optimizer to allow for more general folding
@pjmore is this the last missing piece before you can resume the tokomak work? |
@houqp Yep, that is everything! I'm going to start on that now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very interesting work! Thanks a lot!
} | ||
|
||
///Evaluates an expression if it only contains literal expressions. If a non-literal expression or non-immutable function is found then it returns an error. | ||
pub fn evalute_const_expr(expr: &Expr) -> Result<ScalarValue> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you plan to use this method in a PR soon? Otherwise don't expose this function if it is never used (YAGNI).
Same would apply to expr_is_const()
that is only used here 😃.
If you do plan to use it in a follow up PR, I would say that it would be very important to test that expr_is_const()
gives consistent results with rewrite_const_expr()
(If I understand the meaning of the return value correctly)
} | ||
|
||
///Evaluates an expression. Note that this will return incorrect results if the expression contains function calls which change return values call to call, such as Now(). | ||
pub fn evaluate_const_expr_unchecked(expr: &Expr) -> Result<ScalarValue> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand this is somewhat covered by the tests in constant_folding.rs
, but if it is exposed publicly it should have its own set of tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: also it is surprising that this method uses fully qualified syntax (like std::sync::Arc
) almost everywhere 😃
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the review in 2 parts 😅.
let dummy_df_schema = DFSchema::empty(); | ||
let dummy_input_schema = arrow::datatypes::Schema::new(vec![Field::new( | ||
DUMMY_COL_NAME, | ||
DataType::Float64, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we maybe use the Null
type here to make it clearer that it doesn't matter?
DataType::Float64, | |
DataType::Null, |
expr | ||
} | ||
|
||
fn rewrite_const_expr(&mut self, expr: &mut Expr) -> bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we maybe comment on the meaning of the returned boolean value? If I understand correctly, it indicates whether the expression can be evaluated as a constant expression, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that is what it is, I'll add some documentation.
false | ||
} | ||
(false, true) => { | ||
self.const_fold_list_eager(list); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case we run rewrite_const_expr()
twice on each expression, is that expected?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well rereading the code it's expected but not intended haha.
} | ||
|
||
///Evaluates an expression. Note that this will return incorrect results if the expression contains function calls which change return values call to call, such as Now(). | ||
pub fn evaluate_const_expr_unchecked(expr: &Expr) -> Result<ScalarValue> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: also it is surprising that this method uses fully qualified syntax (like std::sync::Arc
) almost everywhere 😃
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yet another question, sorry for that 😃. In case you are wondering, I am looking into to this PR thoroughly because I need something somewhat similar in #1141 to execute a partial filtering expression on the partition values: https://github.com/apache/arrow-datafusion/blob/ecb158a9fb86c52e97edb01d345d4edcb84dcefe/datafusion/src/datasource/listing.rs#L346-L428
Expr::ScalarFunction { | ||
fun: BuiltinScalarFunction::Now, | ||
.. | ||
} => Expr::Literal(ScalarValue::TimestampNanosecond(Some( | ||
self.execution_props | ||
.query_execution_start_time | ||
.timestamp_nanos(), | ||
))), | ||
Expr::ScalarFunction { | ||
fun: BuiltinScalarFunction::ToTimestamp, | ||
args, | ||
} => { | ||
if !args.is_empty() { | ||
match &args[0] { | ||
Expr::Literal(ScalarValue::Utf8(Some(val))) => { | ||
match string_to_timestamp_nanos(val) { | ||
Ok(timestamp) => Expr::Literal( | ||
ScalarValue::TimestampNanosecond(Some(timestamp)), | ||
), | ||
_ => Expr::ScalarFunction { | ||
fun: BuiltinScalarFunction::ToTimestamp, | ||
args, | ||
}, | ||
} | ||
*expr = Expr::Literal(ScalarValue::TimestampNanosecond(Some( | ||
self.execution_props | ||
.query_execution_start_time | ||
.timestamp_nanos(), | ||
))); | ||
true | ||
} | ||
Expr::ScalarFunction { fun, args } => { | ||
if args.is_empty() { | ||
false | ||
} else { | ||
let volatility = fun.volatility(); | ||
match volatility { | ||
Volatility::Immutable => self.const_fold_list(args), | ||
_ => { | ||
self.const_fold_list_eager(args); | ||
false | ||
} | ||
_ => Expr::ScalarFunction { | ||
fun: BuiltinScalarFunction::ToTimestamp, | ||
args, | ||
}, | ||
} | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please explain why we isolate the case of Now
here instead of considering Volatility::Stable
ScalarFunction
s as foldable? Stable functions are supposed to be constant in the context of the query, isn't that enough?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could have sworn that now() was implemented by rewriting the expression tree but it looks like that is wrong. You're right stable functions should be foldable.
I will review this later today -- thank you @pjmore ! |
&dummy_input_schema, | ||
&ctx_state, | ||
)?; | ||
let col = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you could use new_null_array
here from arrow:
https://docs.rs/arrow/6.0.0/arrow/array/fn.new_null_array.html
} | ||
///This attempts to simplify expressions of the form col(Boolean) = Boolean and col(Boolean) != Boolean | ||
/// e.g. col(Boolean) = Some(true) -> col(Boolean). It also handles == and != between two boolean literals as | ||
/// the binary operator physical expression currently doesn't handle them. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to move this to the binary operator then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if we take the approach on #1145, we can remove this code that handles constants
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like that your approach uses the expression rewriting framework that's already in place
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @pjmore -- this is pretty awesome.
One thing I was thinking when looking at this PR, was that the partial evaluation / rewrites and and constant folding were somewhat intertwined.
What would you think of extracting the partial evaluator into a function that could be invoked by the constant folding pass rather than something that is part of the constant folding pass?
I sketched out what this might look like in this PR here #1145 (I am happy to finish that up if you think it is valuable)
I was thinking then we could encode rules like
false OR <expr> --> <expr>
false AND <expr> --> false
as part of the constant folder and have the constant folder invoke the evaluator.
What do you think?
} | ||
} | ||
|
||
///Evaluates an expression if it only contains literal expressions. If a non-literal expression or non-immutable function is found then it returns an error. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest the public API should be something like:
partially_evaluate_expr(expr: &Expr)
or
Expr::partially_evaluate(self) -> Self
Yeah I think that's is a better approach. Those special cases with Expr::Not and Expr::BinaryExpr are not pretty in my implementation. Given that the vast majority of this PR can be deleted if we go that route I'm going to close this. |
Cool -- thank you @pjmore for pushing this forward. Would you like to keep working at it, or shall I polish up #1145 to get that merged? This feature is going to really push DataFusion forward I think |
Which issue does this PR close?
Closes #1070.
What changes are included in this PR?
Added an evaluate function which evaluates literal scalar expressions using the existing physical expression implementation. Reworked ConstantFolding optimizer using the evaluate function to perform more generalized constant folding. The optimizer now traverses the expression tree depth first returning whether the expression is made up of literal expressions. When an expression that has multiple children is encountered all children expressions are checked to see if they are literals, if so then it simply returns that the expression is literal. If one of the child expressions is not literal then all literal child expressions are evaluated and then non literal is returned. For expressions that can never be evaluated in a scalar manner, such as aggregate or window expressions, the children are traversed but the expression itself will never be evaluated.
Are there any user-facing changes?
The added evaluate function is currently public to allow usage of the function from other crates in the workspace.