Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Allow to create record batch without validation #7071

Open
rluvaton opened this issue Feb 4, 2025 · 3 comments · May be fixed by #7072
Open

Allow to create record batch without validation #7071

rluvaton opened this issue Feb 4, 2025 · 3 comments · May be fixed by #7072
Labels
enhancement Any new improvement worthy of a entry in the changelog

Comments

@rluvaton
Copy link
Contributor

rluvaton commented Feb 4, 2025

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
I want to sort a list based on custom comparator function that is made out of data fusion expressions

To do a comparison sort we obviously need 2 items from the list, so I created the following schema with 2 identical columns:

let schema = Arc::new(Schema::new(vec![
    Field::new("col_0", list.data_type().clone(), nullable),
    Field::new("col_1", list.data_type().clone(), nullable),
]));

and then I create a record batch with the schema above for every sort_by call

let mut indices = (0..list.len()).collect::<Vec<_>>();
indices.sort_by(
    |&a, &b| {
       let function_input = RecordBatch::try_new(
            Arc::clone(&schema),
            vec![list.slice(a, 1), list.slice(b, 1)],
        );

        // evaluate on the record batch

        // return `std::cmp::Ordering` value
}) 

as you can see I call try_new a lot of times which does a lot of checks that will evaluate to the same thing over and over again

Describe the solution you'd like
It would be helpful to have a function called new_unchecked that is unsafe and doesn't do any checks

Describe alternatives you've considered
Just calling try_new over and over again

@rluvaton rluvaton added the enhancement Any new improvement worthy of a entry in the changelog label Feb 4, 2025
@rluvaton rluvaton linked a pull request Feb 4, 2025 that will close this issue
@tustvold
Copy link
Contributor

tustvold commented Feb 4, 2025

Evaluating the expressions in the comparator will be extremely inefficient:

  • It will potentially evaluate the expressions multiple times
  • Expressions are designed to be evaluated on arrays of multiple values

I'm not sure your precise use-case, but typically one would evaluate the expressions on the input batches and then process the outputs in one shot.

I struggle to think of an example where slicing in a hot loop is not a code smell.

@rluvaton
Copy link
Contributor Author

rluvaton commented Feb 4, 2025

Evaluating the expressions in the comparator will be extremely inefficient:

  • It will potentially evaluate the expressions multiple times
  • Expressions are designed to be evaluated on arrays of multiple values

I could materialise every transformation on each item before hand (when possible)

I'm not sure your precise use-case, but typically one would evaluate the expressions on the input batches and then process the outputs in one shot.

The use case is similar to array_sort with custom compare function in spark
https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.array_sort.html

@tustvold
Copy link
Contributor

tustvold commented Feb 4, 2025

https://docs.rs/arrow-ord/latest/arrow_ord/ord/type.DynComparator.html

Is how I would recommend providing a similar interface in a columnar execution environment (spark is not a vectorized engine and instead relies on JIT compilation).

So users would instead provide a function with a signature similar to https://docs.rs/arrow-ord/latest/arrow_ord/ord/fn.make_comparator.html

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
enhancement Any new improvement worthy of a entry in the changelog
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants