-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Do not rename struct fields when coercing types in CASE
#14384
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Conversation
CASE
Some(Struct(fields.into())) | ||
} | ||
_ => None, | ||
} | ||
} | ||
|
||
/// returns the result of coercing two fields to a common type |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here is the fix. I suspect this doesn't matter in most cases as the result of a comparison is boolean
so the names of the intermediate fields don't matter.
However, in a CASE
(and union for that matter) the names do matter, so this PR preserves them
{ 'xxx': 'e' } -- column4: Struct with field xxx (no second field) | ||
); | ||
|
||
# Note field names are in different orders |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test documents some strange behavior that I don't really know if is a bug or not
DataFusion treats the field orders as important, and when coercing two structs, it does it in field order (not by name)
So in this case { 'foo': 'a', 'xxx': 'b' }
, and { 'xxx': 'c', 'foo': 'd' }
are coerced to compare the values a
/c
and b
/ d
(not the values a
/d
and b
/c
This PR doesn't change this behavior, but I felt that was an important behavior to document in tests, so I did so
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- While working on Fix regression list Type Coercion List with inner type struct which has large/view types #14385 I found that this behaior behavior (using struct position rather than field name) is inconsistent
Specifically the basic case comparison goes through "comparison_coercion" (fixed in this PR) which uses the list field order
But when coercing list elements, that goes through type_union_resolution
, (link) which matches fields by name (link)
@jayzhan211 is the eventual plan to unify thee code paths? Or maybe the case coercion should use type_union_resolution
🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We follow how DuckDB works, I think both field order and name should be the same
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be clear this PR doesn't change the existing behavior of comparison_coercion other than not renaming field names
@jayzhan211 as our resident coercion expert can you possibly review this bug? It is similar to #14385 |
/// returns the result of coercing two fields to a common type | ||
fn coerce_fields(common_type: DataType, lhs: &FieldRef, rhs: &FieldRef) -> FieldRef { | ||
let is_nullable = lhs.is_nullable() || rhs.is_nullable(); | ||
let name = lhs.name(); // pick the name from the left field |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if the name is not the same, return error
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that would be better -- can I do it in a follow on PR?
In my opinion this PR makes things better (doesn't lose field names) even though it is not perfect and I would like to fix the regression since 43
What I think we should do is switch comparison_coercion
to use type_union_coercion
eventually and then implement the correct coercion there
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think type union resolution is the correct on for CASE
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I filed
to track the issue and plan to merge this one in when ready
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not renaming the fields looks like a good improvement to me. Thanks @alamb.
Thank you @andygrove What I plan to do is to file a follow on ticket to address @jayzhan211 's (very valid) concern that the coercion should error if the fields have different names and merge this PR in tomorrow |
* Fix field name during struct equality coercion * fix bug * Add more tests * Update tests per apache#14396
Which issue does this PR close?
Case
coercion of Structs loses field names #14383Rationale for this change
When coercing fields for a struct, we shouldn't rename the fields to
c0
,c1
, etc. See #14383What changes are included in this PR?
Preserve the Struct field names
Are these changes tested?
Yes, sqllogictests are added
Are there any user-facing changes?
coerced struct field names do not revever to
c0
, `c1, etc