Skip to content

RFC: Demonstrate what a function package might look like -- encoding expressions #8046

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Closed
wants to merge 24 commits into from

Conversation

alamb
Copy link
Contributor

@alamb alamb commented Nov 3, 2023

Which issue does this PR close?

Builds on #8039

Demonstrates what #8045 might look like

Rationale for this change

This PR demonstrates what a function package API might look like by removing encoding expressions encode/decode from BuiltInScalarFunction enum and adding it in a separate crate (datafusion-functions)

What changes are included in this PR?

  1. A new FunctionImplementation trait and integration into ScalarUDF to make it easier to write ScalarUDFs;
  2. a new datafusion-functions crate that has the implementation of encode and decode.
  3. Automatically register these functions as part of SessionState::new(), similarly to the automatically registered ListingTables
  4. TODO optional enabling of functions based on feature flag

Open Questions:

  1. to support the expr_fns encode and decode, I think we will need a Expr::ScalarFunction call or something that can take a function by name rather than fully resolved function
  2. Extract registration functions from SessionContext into their own trait / consolidate the function registry code rather than passing
    around a set of HahsMaps.... And make a way to actually modify them

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added logical-expr Logical plan and expressions physical-expr Changes to the physical-expr crates optimizer Optimizer rules core Core DataFusion crate labels Nov 3, 2023
@alamb alamb force-pushed the alamb/extract_encoding_expressions branch from b3e25be to c441a0d Compare November 3, 2023 21:42
@@ -710,30 +704,6 @@ impl BuiltinScalarFunction {
BuiltinScalarFunction::Digest => {
utf8_or_binary_to_binary_type(&input_expr_types[0], "digest")
}
BuiltinScalarFunction::Encode => Ok(match input_expr_types[0] {
Copy link
Contributor Author

@alamb alamb Nov 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This metadata information about the functions is now moved into functions/encoding.rs module, along side its implementation

}
}

/// Convenience trait for implementing ScalarUDF. See [`ScalarUDF::from_impl()`]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This echo's the trait that @2010YOUY01 proposed in ) #7752, but does so in a way that is backwards compatible (makes a ScalarUDF out of the trait, to retain backwards compatibly)


pub(super) struct EncodeFunc {}

static ENCODE_SIGNATURE: OnceLock<Signature> = OnceLock::new();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is what encode and decode look like using the ScalarUDF API -- I think they are much clearer when all this type information is in one place (though I still kept it separate from the implementation to show the implementation did not change at all)

use std::collections::HashMap;
use std::sync::Arc;

/// Registers the `encode` and `decode` functions with the function registry
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the conditional registration of these functions based on feature flag -- there are probably nicer ways to do this but I don't think it is any worse than the current solution.


/// Registers all "built in" functions from this crate with the provided registry
pub fn register_all(registry: &mut HashMap<String, Arc<ScalarUDF>>) {
encoding::register(registry);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I envision extending this list with other packages over time.

@alamb
Copy link
Contributor Author

alamb commented Nov 4, 2023

@2010YOUY01 and @viirya I wonder if you have any thoughts on this approach / proposal?

Copy link
Contributor

@2010YOUY01 2010YOUY01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, this looks great. I have several questions/suggestions:

  1. to support the expr_fns encode and decode, I think we will need a Expr::ScalarFunction call or something that can take a function by name rather than fully resolved function

Now constructing an Expr for built-in functions is stateless (does not require context), so it's tricky to be backwards compatible for Expr API, the best solution I can think of is to also support initializing a UDF Expr with only name string, and resolve them during logical plan optimization.

  1. Extract registration functions from SessionContext into their own trait / consolidate the function registry code rather than passing
    around a set of HahsMaps.... And make a way to actually modify them

It's a good idea to pack 3 HashMaps for scalar/aggr/window UDFs into a new struct like FunctionRegistry 👍🏼

pub mod utils;

/// Registers all "built in" functions from this crate with the provided registry
pub fn register_all(registry: &mut HashMap<String, Arc<ScalarUDF>>) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should support registering a single function here, there might be a use case that the user wants to override only one function from a function package
(possibly by changing this interface to something like

pub fn register_all() {
    register_package(encoding::all_functions());
    register_function(my_encoding::decode()); // override a method in default function package
}

Comment on lines 120 to 137
/// Returns this function's name
pub fn name(&self) -> &str {
&self.name
}
/// Returns this function's signature
pub fn signature(&self) -> &Signature {
&self.signature
}
/// return the return type of this function given the types of the arguments
pub fn return_type(&self, args: &[DataType]) -> Result<DataType> {
// Old API returns an Arc of the datatype for some reason
let res = (self.return_type)(args)?;
Ok(res.as_ref().clone())
}
/// return the implementation of this function
pub fn fun(&self) -> &ScalarFunctionImplementation {
&self.fun
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that the case this set of interfaces is internal-faced for execution, we might extend it during separating function packages?
And trait FunctionImplementation is the user-faced API for defining functions in separate crates

@alamb
Copy link
Contributor Author

alamb commented Nov 6, 2023

Thank you, this looks great. I have several questions/suggestions:

  1. to support the expr_fns encode and decode, I think we will need a Expr::ScalarFunction call or something that can take a function by name rather than fully resolved function

Now constructing an Expr for built-in functions is stateless (does not require context), so it's tricky to be backwards compatible for Expr API, the best solution I can think of is to also support initializing a UDF Expr with only name string, and resolve them during logical plan optimization.

Yes, I agree this approach is the best I can come up with.

  1. Extract registration functions from SessionContext into their own trait / consolidate the function registry code rather than passing
    around a set of HahsMaps.... And make a way to actually modify them

It's a good idea to pack 3 HashMaps for scalar/aggr/window UDFs into a new struct like FunctionRegistry 👍🏼

👍 Unfortunately that name is already taken :) Maybe MemoryFunctionRegistry 🤔

@github-actions github-actions bot removed the optimizer Optimizer rules label Nov 18, 2023
@@ -34,6 +34,17 @@ pub trait FunctionRegistry {

/// Returns a reference to the udwf named `name`.
fn udwf(&self, name: &str) -> Result<Arc<WindowUDF>>;

/// Registers a new `ScalarUDF`, returning any previously registered
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a new proposed API -- to allow registering new scalar UDFs with a FunctionRegistry.

@@ -1228,30 +1229,4 @@ mod test {
unreachable!();
}
}

#[test]
fn encode_function_definitions() {
Copy link
Contributor Author

@alamb alamb Nov 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think these tests add a lot -- they simply encode the signature again. This is also covered by actually calling encode() via the expr API / dataframe tests which is done.

pub mod expr_fn {
use super::*;
/// Return encode(arg)
pub fn encode(args: Vec<Expr>) -> Expr {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here are the new expr_fn implementations

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
core Core DataFusion crate logical-expr Logical plan and expressions physical-expr Changes to the physical-expr crates
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants