Skip to content

Update the BTRIM scalar function to support Utf8View #11835

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Closed
Tracked by #11790
tshauck opened this issue Aug 5, 2024 · 2 comments · Fixed by #11920
Closed
Tracked by #11790

Update the BTRIM scalar function to support Utf8View #11835

tshauck opened this issue Aug 5, 2024 · 2 comments · Fixed by #11920
Assignees
Labels
enhancement New feature or request

Comments

@tshauck
Copy link
Contributor

tshauck commented Aug 5, 2024

Part of #11752 and #11790

Currently, a call to BTRIM with a Utf8View datatypes induces a cast. After the change that fixes this issue, it should not.

## Ensure no casts for BTRIM
query TT
EXPLAIN SELECT
  BTRIM(column1_utf8view, 'foo') AS l
FROM test;
----
logical_plan
01)Projection: btrim(CAST(test.column1_utf8view AS Utf8), Utf8("foo")) AS l
02)--TableScan: test projection=[column1_utf8view]

btrim is defined here: https://github.com/apache/datafusion/blob/main/datafusion/functions/src/string/btrim.rs

Is your feature request related to a problem or challenge?

We are working to add complete StringView support in DataFusion, which permits potentially much faster processing of string data. See #10918 for more background.

Today, most DataFusion string functions support DataType::Utf8 and DataType::LargeUtf8 and when called with a StringView argument DataFusion will cast the argument back to DataType::Utf8 which is expensive.

To realize the full speed of StringView, we need to ensure that all string functions support the DataType::Utf8View directly.

Describe the solution you'd like

Update the function to support DataType::Utf8View directly

Describe alternatives you've considered

The typical steps are:

  1. Write some tests showing the function doesn't support Utf8View (see the tests in string_view.slt to ensure the arguments are not being cast
  2. Change the Signature of the function to accept Utf8View in addition to Utf8/LargeUtf8
  3. Update the implementation of the function to operate on Utf8View

Example PRs

Additional context

The documentation of string functions can be found here: https://datafusion.apache.org/user-guide/sql/scalar_functions.html#string-functions

To test a function with StringView with datafusion-cli you can use an example like this (replacing starts_with with the relevant function)

> create table foo as values (arrow_cast('foo', 'Utf8View'), arrow_cast('bar', 'Utf8View'));
0 row(s) fetched.
Elapsed 0.043 seconds.

> select starts_with(column1, column2) from foo;
+--------------------------------------+
| starts_with(foo.column1,foo.column2) |
+--------------------------------------+
| false                                |
+--------------------------------------+
1 row(s) fetched.
Elapsed 0.015 seconds.

To see if it is using utf8 view, use EXPLAIN to see the plan and verify there is no CAST. In this example the CAST(column1@0 AS Utf8) indicates that the function is not using Utf8View natively

> explain select starts_with(column1, column2) from foo;
+---------------+------------------------------------------------------------------------------------------------------------------------------+
| plan_type     | plan                                                                                                                         |
+---------------+------------------------------------------------------------------------------------------------------------------------------+
| logical_plan  | Projection: starts_with(CAST(foo.column1 AS Utf8), CAST(foo.column2 AS Utf8))                                                |
|               |   TableScan: foo projection=[column1, column2]                                                                               |
| physical_plan | ProjectionExec: expr=[starts_with(CAST(column1@0 AS Utf8), CAST(column2@1 AS Utf8)) as starts_with(foo.column1,foo.column2)] |
|               |   MemoryExec: partitions=1, partition_sizes=[1]                                                                              |
|               |                                                                                                                              |
+---------------+------------------------------------------------------------------------------------------------------------------------------+
2 row(s) fetched.
Elapsed 0.006 seconds.

It is also often good to test with a constant as well (likewise there should be no cast):

> explain select starts_with(column1, 'foo') from foo;
+---------------+----------------------------------------------------------------------------------------------------------+
| plan_type     | plan                                                                                                     |
+---------------+----------------------------------------------------------------------------------------------------------+
| logical_plan  | Projection: starts_with(CAST(foo.column1 AS Utf8), Utf8("foo"))                                          |
|               |   TableScan: foo projection=[column1]                                                                    |
| physical_plan | ProjectionExec: expr=[starts_with(CAST(column1@0 AS Utf8), foo) as starts_with(foo.column1,Utf8("foo"))] |
|               |   MemoryExec: partitions=1, partition_sizes=[1]                                                          |
|               |                                                                                                          |
+---------------+----------------------------------------------------------------------------------------------------------+
2 row(s) fetched.
Elapsed 0.002 seconds.

### Additional context

_No response_
@tshauck tshauck added the enhancement New feature or request label Aug 5, 2024
@tshauck tshauck changed the title Update the BTRIM scalar function to support Utf8View Update the BTRIM scalar function to support Utf8View Aug 5, 2024
@Kev1n8
Copy link
Contributor

Kev1n8 commented Aug 7, 2024

take

@Kev1n8
Copy link
Contributor

Kev1n8 commented Aug 8, 2024

I have a question about how this should be done. BTRIM would like to modify the input string, while Utf8View is not mutable. So is it correct that the input data_type can only be Utf8 or LargeUtf8? If so, should I leave the signature as

            signature: Signature::one_of(
                vec![
                    Exact(vec![Utf8]),
                    Exact(vec![Utf8, Utf8]),
                    Exact(vec![Utf8, Utf8View]),
                    // BTRIM wants to modify string, while StringView is not mutable, hence
                    // input should be cast to utf8
                    // Exact(vec![Utf8View]),
                    // Exact(vec![Utf8View, Utf8]),
                    // Exact(vec![Utf8View, Utf8View]),
                ],
                Volatility::Immutable,
            ),

or manually cast it in invoke(), which does not display CAST in plans:


            DataType::Utf8View => {
                let args = &[
                    args[0].cast_to(&DataType::Utf8, None)?,
                    args[1].clone(),
                ];
                make_scalar_function(
                    btrim::<i32>,
                    vec![Hint::Pad, Hint::AcceptsSingular],
                )(args)
            }

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants