-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Simplify windows builtin functions return type #8920
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR is to unify output types to be defined in 1 single place and be transferred through the schema.
Thank you @comphead -- this change looks good to me.
Before we merge it, however, I think we should give @mustafasrepo / @ozankabak a chance to comment as I think they are familiar with this code
@@ -3906,3 +3906,69 @@ ProjectionExec: expr=[sn@0 as sn, ts@1 as ts, currency@2 as currency, amount@3 a | |||
--BoundedWindowAggExec: wdw=[SUM(table_with_pk.amount) ORDER BY [table_with_pk.sn ASC NULLS LAST] ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW: Ok(Field { name: "SUM(table_with_pk.amount) ORDER BY [table_with_pk.sn ASC NULLS LAST] ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW", data_type: Float64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }), frame: WindowFrame { units: Rows, start_bound: Preceding(UInt64(NULL)), end_bound: CurrentRow }], mode=[Sorted] | |||
----SortExec: expr=[sn@0 ASC NULLS LAST] | |||
------MemoryExec: partitions=1, partition_sizes=[1] | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These tests all seem to pass for me on main as well (without the changes in this PR). Is that expected?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, tests checked the functionality is not regressed.
@@ -719,14 +720,16 @@ impl DefaultPhysicalPlanner { | |||
} | |||
|
|||
let logical_input_schema = input.schema(); | |||
let physical_input_schema = input_exec.schema(); | |||
// Extend the schema to include window expression fields as builtin window functions derives its datatype from incoming schema |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't input.schema()
reflect all the columns that the input produces?
Or does the WindowAggExec
create new columns "internally" by evaluating the window expressions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
input schema is schema from previous plan node(no window expressions). afaik windows expression column being added separately
@mustafasrepo PTAL |
I think this PR add two functionality. Which are
I think first change is better. However, second change is misleading. With the changes in this PR, given |
Filed the PR8945 for retracting second change |
let mut window_fields = logical_input_schema.fields().clone(); | ||
window_fields.extend_from_slice(&exprlist_to_fields(window_expr.iter(), input)?); | ||
let extended_schema = &DFSchema::new_with_metadata(window_fields, HashMap::new())?; | ||
let window_expr = window_expr | ||
.iter() | ||
.map(|e| { | ||
create_window_expr( | ||
e, | ||
logical_input_schema, | ||
&physical_input_schema, | ||
extended_schema, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you need to recompute this extended_schema
which looks like the same of window operator's schema? You can simply get the window's schema.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Filed #8955
@@ -1598,21 +1602,14 @@ pub fn create_window_expr_with_name( | |||
pub fn create_window_expr( | |||
e: &Expr, | |||
logical_input_schema: &DFSchema, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is misleading, it is actually window's schema, not input schema now.
@@ -1572,7 +1575,8 @@ pub fn create_window_expr_with_name( | |||
create_physical_sort_expr(e, logical_input_schema, execution_props) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, this looks incorrect as you use window's schema as input schema. Although window's schema is input schema + window functions output, it is why this change still makes thing work. But it is actually misleading for readers and probably cause of potential bugs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, this looks incorrect as you use window's schema as input schema. Although window's schema is input schema + window functions output, it is why this change still makes thing work. But it is actually misleading for readers and probably cause of potential bugs.
This may be what @mustafasrepo has improved in #8920 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, this looks incorrect as you use window's schema as input schema. Although window's schema is input schema + window functions output, it is why this change still makes thing work. But it is actually misleading for readers and probably cause of potential bugs.
This may be what @mustafasrepo has improved in #8920 (comment)
Exactly, I re-introduced the invariant of using only input schema
Which issue does this PR close?
Closes #.
Rationale for this change
Before the PR the Datafusion derives the output datatypes for builtin functions twice.
Initially set in
BuiltInWindowFunction::return_type
the output datatype redefined in other places:This PR is to unify output types to be defined in 1 single place and be transferred through the schema.
Another good benefit of it is other systems that embed DataFusion as an engine can now use its datatypes by constructing expected schema.
What changes are included in this PR?
Datatype definition unified, covered with new tests, some other minor optimizations
Are these changes tested?
Yes
Are there any user-facing changes?
No