-
Notifications
You must be signed in to change notification settings - Fork 108
Upgrade to datafusion 38 #691
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Conversation
Upstream is continuing it's migration to UDFs. Ref apache/datafusion#10098 Ref apache/datafusion#10372
…ters_pushdown Deprecated function removed in apache/datafusion#9923
These relied on upstream BuiltinScalarFunction, which are now removed. Ref apache/datafusion#10098
`null_count` was fixed upstream. Ref apache/datafusion#10260
DFField was removed upstream. Ref: apache/datafusion#9595
f311d66
to
abe09a2
Compare
|
||
pub fn column_name(&self, plan: PyLogicalPlan) -> PyResult<String> { | ||
self._column_name(&plan.plan()).map_err(py_runtime_err) | ||
} | ||
} | ||
|
||
impl PyExpr { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jdye64 you may want to review this PR since it removes code that I believe you originally added
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jdye64 - I had removed the method because it relied on DFField
which was removed in datafusion
.
The last commit attempts to re-implement the method using arrow's Field
.
I'd still much appreciate any feedback / context!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also cc @charlesbluca
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like Dask SQL is using a pinned version of this repo from more than six months ago, so we likely won't get a review from the team right away. The new functionality based on Field
looks good to me, so I will go ahead and merge this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah this is fine. Honestly we need to come up with a better way to get the column name anyway and as you mentioned are using a pinned older version for now anyway.
"a": [3.0, 0.0, 2.0, 1.0, 1.0, 3.0, 2.0], | ||
"b": [3.0, 0.0, 5.0, 1.0, 4.0, 6.0, 5.0], | ||
"c": [3.0, 0.0, 7.0, 1.7320508075688772, 5.0, 8.0, 8.0], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are these changes needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
null_count
was fixed upstream in apache/datafusion#10260
The underlying data being described:
>>> print(df)
DataFrame()
+---+---+---+
| a | b | c |
+---+---+---+
| 1 | 4 | 8 |
| 2 | 5 | 5 |
| 3 | 6 | 8 |
+---+---+---+
The previous implementation relied on `DFField` which was removed upstream. Ref: apache/datafusion#9595
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thank you @Michael-J-Ward. It is great to see this project keeping up with DataFusion core.
Which issue does this PR close?
Closes #690.
Are there any user-facing changes?
DFField
and related methods were removedPyScalarFunction
andPyBuiltinScalarFunction
were removednull_count
was fixed upstream so the behavior has changed