Skip to content

Upgrade to datafusion 38 #691

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Merged
merged 11 commits into from
May 14, 2024
Merged

Conversation

Michael-J-Ward
Copy link
Contributor

Which issue does this PR close?

Closes #690.

Are there any user-facing changes?

  • DFField and related methods were removed
  • PyScalarFunction and PyBuiltinScalarFunction were removed
  • null_count was fixed upstream so the behavior has changed


pub fn column_name(&self, plan: PyLogicalPlan) -> PyResult<String> {
self._column_name(&plan.plan()).map_err(py_runtime_err)
}
}

impl PyExpr {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jdye64 you may want to review this PR since it removes code that I believe you originally added

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jdye64 - I had removed the method because it relied on DFField which was removed in datafusion.

The last commit attempts to re-implement the method using arrow's Field.

I'd still much appreciate any feedback / context!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also cc @charlesbluca

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like Dask SQL is using a pinned version of this repo from more than six months ago, so we likely won't get a review from the team right away. The new functionality based on Field looks good to me, so I will go ahead and merge this PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this is fine. Honestly we need to come up with a better way to get the column name anyway and as you mentioned are using a pinned older version for now anyway.

Comment on lines +733 to +735
"a": [3.0, 0.0, 2.0, 1.0, 1.0, 3.0, 2.0],
"b": [3.0, 0.0, 5.0, 1.0, 4.0, 6.0, 5.0],
"c": [3.0, 0.0, 7.0, 1.7320508075688772, 5.0, 8.0, 8.0],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are these changes needed?

Copy link
Contributor Author

@Michael-J-Ward Michael-J-Ward May 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

null_count was fixed upstream in apache/datafusion#10260

The underlying data being described:

>>> print(df)
DataFrame()
+---+---+---+
| a | b | c |
+---+---+---+
| 1 | 4 | 8 |
| 2 | 5 | 5 |
| 3 | 6 | 8 |
+---+---+---+

The previous implementation relied on `DFField` which was removed upstream.

Ref: apache/datafusion#9595
Copy link
Member

@andygrove andygrove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thank you @Michael-J-Ward. It is great to see this project keeping up with DataFusion core.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Tracking Upgrade to Datafusion 38
3 participants