Allow to create record batch without validation #7071

rluvaton · 2025-02-04T09:39:29Z

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
I want to sort a list based on custom comparator function that is made out of data fusion expressions

To do a comparison sort we obviously need 2 items from the list, so I created the following schema with 2 identical columns:

let schema = Arc::new(Schema::new(vec![
    Field::new("col_0", list.data_type().clone(), nullable),
    Field::new("col_1", list.data_type().clone(), nullable),
]));

and then I create a record batch with the schema above for every sort_by call

let mut indices = (0..list.len()).collect::<Vec<_>>();
indices.sort_by(
    |&a, &b| {
       let function_input = RecordBatch::try_new(
            Arc::clone(&schema),
            vec![list.slice(a, 1), list.slice(b, 1)],
        );

        // evaluate on the record batch

        // return `std::cmp::Ordering` value
})

as you can see I call try_new a lot of times which does a lot of checks that will evaluate to the same thing over and over again

Describe the solution you'd like
It would be helpful to have a function called new_unchecked that is unsafe and doesn't do any checks

Describe alternatives you've considered
Just calling try_new over and over again

The text was updated successfully, but these errors were encountered:

tustvold · 2025-02-04T11:32:43Z

Evaluating the expressions in the comparator will be extremely inefficient:

It will potentially evaluate the expressions multiple times
Expressions are designed to be evaluated on arrays of multiple values

I'm not sure your precise use-case, but typically one would evaluate the expressions on the input batches and then process the outputs in one shot.

I struggle to think of an example where slicing in a hot loop is not a code smell.

rluvaton · 2025-02-04T13:10:25Z

Evaluating the expressions in the comparator will be extremely inefficient:

It will potentially evaluate the expressions multiple times

Expressions are designed to be evaluated on arrays of multiple values

I could materialise every transformation on each item before hand (when possible)

I'm not sure your precise use-case, but typically one would evaluate the expressions on the input batches and then process the outputs in one shot.

The use case is similar to array_sort with custom compare function in spark
https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.array_sort.html

tustvold · 2025-02-04T13:13:30Z

https://docs.rs/arrow-ord/latest/arrow_ord/ord/type.DynComparator.html

Is how I would recommend providing a similar interface in a columnar execution environment (spark is not a vectorized engine and instead relies on JIT compilation).

So users would instead provide a function with a signature similar to https://docs.rs/arrow-ord/latest/arrow_ord/ord/fn.make_comparator.html

rluvaton added the enhancement Any new improvement worthy of a entry in the changelog label Feb 4, 2025

rluvaton linked a pull request Feb 4, 2025 that will close this issue

feat: add new_unchecked method for RecordBatch #7072

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow to create record batch without validation #7071

Allow to create record batch without validation #7071

rluvaton commented Feb 4, 2025

tustvold commented Feb 4, 2025 •

edited

Loading

rluvaton commented Feb 4, 2025

tustvold commented Feb 4, 2025 •

edited

Loading

Allow to create record batch without validation #7071

Allow to create record batch without validation #7071

Comments

rluvaton commented Feb 4, 2025

tustvold commented Feb 4, 2025 • edited Loading

rluvaton commented Feb 4, 2025

tustvold commented Feb 4, 2025 • edited Loading

tustvold commented Feb 4, 2025 •

edited

Loading

tustvold commented Feb 4, 2025 •

edited

Loading