Skip to content

ImproveSignature and comparison_coercion documentation #13840

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Merged
merged 3 commits into from
Dec 20, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
33 changes: 21 additions & 12 deletions datafusion/expr-common/src/signature.rs
Original file line number Diff line number Diff line change
Expand Up @@ -103,9 +103,13 @@ pub enum TypeSignature {
/// A function such as `concat` is `Variadic(vec![DataType::Utf8,
/// DataType::LargeUtf8])`
Variadic(Vec<DataType>),
/// The acceptable signature and coercions rules to coerce arguments to this
/// signature are special for this function. If this signature is specified,
/// DataFusion will call `ScalarUDFImpl::coerce_types` to prepare argument types.
/// The acceptable signature and coercions rules are special for this
/// function.
///
/// If this signature is specified,
/// DataFusion will call [`ScalarUDFImpl::coerce_types`] to prepare argument types.
///
/// [`ScalarUDFImpl::coerce_types`]: https://docs.rs/datafusion/latest/datafusion/logical_expr/trait.ScalarUDFImpl.html#method.coerce_types
UserDefined,
/// One or more arguments with arbitrary types
VariadicAny,
Expand All @@ -123,24 +127,29 @@ pub enum TypeSignature {
/// One or more arguments belonging to the [`TypeSignatureClass`], in order.
///
/// For example, `Coercible(vec![logical_float64()])` accepts
/// arguments like `vec![DataType::Int32]` or `vec![DataType::Float32]`
/// arguments like `vec![Int32]` or `vec![Float32]`
/// since i32 and f32 can be cast to f64
///
/// For functions that take no arguments (e.g. `random()`) see [`TypeSignature::Nullary`].
Coercible(Vec<TypeSignatureClass>),
/// One or more arguments that can be "compared"
/// One or more arguments coercible to a single, comparable type.
///
/// Each argument will be coerced to a single type using the
/// coercion rules described in [`comparison_coercion_numeric`].
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why numeric? Strings are also comparable

Copy link
Contributor

@jayzhan211 jayzhan211 Dec 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, but we have numeric currently, update doc when code is updated to be consistent

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree -- this code could be improved / named better. I looked at the implementation and there are quite a few differences between comparison_coercion_numeric and comparison_coercion that I didn't understand and thus didn't attempt to document

It would be great if someone could

///
/// # Examples
///
/// If the `nullif(1, 2)` function is called with `i32` and `i64` arguments
/// the types will both be coerced to `i64` before the function is invoked.
///
/// Each argument will be coerced to a single type based on comparison rules.
/// For example a function called with `i32` and `i64` has coerced type `Int64` so
/// each argument will be coerced to `Int64` before the function is invoked.
/// If the `nullif('1', 2)` function is called with `Utf8` and `i64` arguments
/// the types will both be coerced to `Utf8` before the function is invoked.
///
/// Note:
/// - If compares with numeric and string, numeric is preferred for numeric string cases. For example, `nullif('2', 1)` has coerced types `Int64`.
/// - If the result is Null, it will be coerced to String (Utf8View).
/// - See [`comparison_coercion`] for more details.
/// - For functions that take no arguments (e.g. `random()` see [`TypeSignature::Nullary`]).
/// - If all arguments have type [`DataType::Null`], they are coerced to `Utf8`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ouch

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am just documenting what I think the code does 🤷 -- I agree this area of the code could be improved

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To fix this, we might need to find a way to differentiate string literal as unknown type v.s. string type. Not yet have time to investigate on this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In some magic ideal world where I had infinite time, I have dreamed about refactor the coercion code into a set of structs that encoded the rules in a more structured / composable manner. But I don't have time to pursue that dream at this time 🏃

///
/// [`comparison_coercion`]: crate::type_coercion::binary::comparison_coercion
/// [`comparison_coercion_numeric`]: crate::type_coercion::binary::comparison_coercion_numeric
Comparable(usize),
/// One or more arguments of arbitrary types.
///
Expand Down
22 changes: 21 additions & 1 deletion datafusion/expr-common/src/type_coercion/binary.rs
Original file line number Diff line number Diff line change
Expand Up @@ -625,6 +625,19 @@ pub fn try_type_union_resolution_with_struct(
/// data type. However, users can write queries where the two arguments are
/// different data types. In such cases, the data types are automatically cast
/// (coerced) to a single data type to pass to the kernels.
///
/// # Numeric comparisons
///
/// When comparing numeric values, the lower precision type is coerced to the
/// higher precision type to avoid losing data. For example when comparing
/// `Int32` to `Int64` the coerced type is `Int64` so the `Int32` argument will
/// be cast.
///
/// # Numeric / String comparisons
///
/// When comparing numeric values and strings, both values will be coerced to
/// strings. For example when comparing `'2' > 1`, the arguments will be
/// coerced to `Utf8` for comparison
pub fn comparison_coercion(lhs_type: &DataType, rhs_type: &DataType) -> Option<DataType> {
if lhs_type == rhs_type {
// same type => equality is possible
Expand All @@ -642,7 +655,14 @@ pub fn comparison_coercion(lhs_type: &DataType, rhs_type: &DataType) -> Option<D
.or_else(|| struct_coercion(lhs_type, rhs_type))
}

/// Similar to [`comparison_coercion`] but prefer numeric if compares with numeric and string
/// Similar to [`comparison_coercion`] but prefers numeric if compares with
/// numeric and string
///
/// # Numeric comparisons
///
/// When comparing numeric values and strings, the values will be coerced to the
/// numeric type. For example, `'2' > 1` if `1` is an `Int32`, the arguments
/// will be coerced to `Int32`.
pub fn comparison_coercion_numeric(
lhs_type: &DataType,
rhs_type: &DataType,
Expand Down
Loading