Skip to content

Support casting BinaryView --> Utf8 and LargeUtf8 #6162

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Closed
Tracked by #6163 ...
alamb opened this issue Jul 31, 2024 · 3 comments · Fixed by #6180
Closed
Tracked by #6163 ...

Support casting BinaryView --> Utf8 and LargeUtf8 #6162

alamb opened this issue Jul 31, 2024 · 3 comments · Fixed by #6180
Labels
arrow Changes to the arrow crate enhancement Any new improvement worthy of a entry in the changelog

Comments

@alamb
Copy link
Contributor

alamb commented Jul 31, 2024

Part of #6163

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
While working to enable StringView use more widely in DataFusion in apache/datafusion#11723 I found this cast function was not supported:

Specifically, create a BinaryViewArray and then call cast to cast it to Utf8:

cast(binary_view_array, &DataType::Utf8)
External error: query failed: DataFusion error: Error during planning: Cannot cast file schema field string_col of type BinaryView to table schema field of type Utf8

I think this came about if a column is marked as "binary" in a parqut file and DataFusion tries to read it in as a Utf8 column the reader will be unbappy

Describe the solution you'd like
Add the support to the cast kernel for BinaryView -> utf8

@RinChanNOWWW added most support in #5704 and I think we can simply use the cast_view_to_byte function to build the correct StringArray

Describe alternatives you've considered

Additional context
FYI @XiangpengHao

@alamb alamb added the enhancement Any new improvement worthy of a entry in the changelog label Jul 31, 2024
@xinlifoobar
Copy link
Contributor

take

@alamb
Copy link
Contributor Author

alamb commented Aug 1, 2024

BTW I had a hacky version in datafusion apache/datafusion#11723: https://github.com/apache/datafusion/pull/11723/files#diff-07b427ee25e195566e30cca0e77e5eb4c63c54ea74f6ea15914fdd7a5a889186R169

In case that helps

// Workaround arrow-rs bug in can_cast_types
// External error: query failed: DataFusion error: Arrow error: Cast error: Casting from BinaryView to Utf8 not supported
fn can_cast_types(from_type: &DataType, to_type: &DataType) -> bool {
    arrow::compute::can_cast_types(from_type, to_type)
        || matches!(
            (from_type, to_type),
            (DataType::BinaryView, DataType::Utf8 | DataType::LargeUtf8)
                | (DataType::Utf8 | DataType::LargeUtf8, DataType::BinaryView)
        )
}

// Work around arrow-rs casting bug
// External error: query failed: DataFusion error: Arrow error: Cast error: Casting from BinaryView to Utf8 not supported
fn cast(array: &dyn Array, to_type: &DataType) -> Result<ArrayRef, ArrowError> {
    match (array.data_type(), to_type) {
        (DataType::BinaryView, DataType::Utf8) => {
            let array = array.as_binary_view();
            let mut builder = StringBuilder::with_capacity(array.len(), 8 * 1024);
            for value in array.iter() {
                // check if the value is valid utf8 (should do this once, not each value)
                let value = value.map(|value| std::str::from_utf8(value)).transpose()?;

                builder.append_option(value);
            }

            Ok(Arc::new(builder.finish()))
        }
        // fallback to arrow kernel
        (_, _) => arrow::compute::cast(array, to_type),
    }
}

@alamb
Copy link
Contributor Author

alamb commented Aug 31, 2024

label_issue.py automatically added labels {'arrow'} from #6180

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
arrow Changes to the arrow crate enhancement Any new improvement worthy of a entry in the changelog
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants