Skip to content

Refactor array_union and array_intersect functions to one general function #8516

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Merged
merged 13 commits into from
Dec 27, 2023

Conversation

Weijun-H
Copy link
Member

@Weijun-H Weijun-H commented Dec 12, 2023

Which issue does this PR close?

Closes #.

Rationale for this change

  • implement the output type for array_intersect
  • support LargeList
  • integrate array_union and array_intersect logic

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added the physical-expr Changes to the physical-expr crates label Dec 12, 2023
@Weijun-H Weijun-H marked this pull request as draft December 12, 2023 16:41
@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Dec 12, 2023
@Weijun-H Weijun-H marked this pull request as ready for review December 12, 2023 21:45
Comment on lines 3046 to 3054
query ?
select array_intersect([1, 1, 2, 2, 3, 3], null);
----
[1, 2, 3]

query ?
select array_intersect(null, [1, 1, 2, 2, 3, 3]);
----
[1, 2, 3]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found previous tests forgot to test that the array should be distinct when one of the arrays is NULL.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should return empty array or null in this case. DuckDB return empty array, Clickhouse return null. I

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previous implement is incorrect

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can return empty array. array_intersect(array, null) -> array (empty)
The idea is turn null to empty array.
array_union(array, null) -> array
array_except(array, null) -> array

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to return empty array for array_intersect, but met this panic.

[/Users/huangweijun/Desktop/rust/arrow-datafusion/datafusion/physical-expr/src/array_expressions.rs:1641] new_empty_array(array1.data_type()).data_type() = List(
    Field {
        name: "item",
        data_type: Int64,
        nullable: true,
        dict_id: 0,
        dict_is_ordered: false,
        metadata: {},
    },
)
thread 'main' panicked at 'index out of bounds: the len is 1 but the index is 1', /Users/huangweijun/.cargo/registry/src/index.crates.io-6f17d22bba15001f/arrow-array-49.0.0/src/array/list_array.rs:289:19
stack backtrace:
   0: rust_begin_unwind
             at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/std/src/panicking.rs:593:5
   1: core::panicking::panic_fmt
             at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/core/src/panicking.rs:67:14
   2: core::panicking::panic_bounds_check
             at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/core/src/panicking.rs:162:5
   3: arrow_array::array::list_array::GenericListArray<OffsetSize>::value
             at /Users/huangweijun/.cargo/registry/src/index.crates.io-6f17d22bba15001f/arrow-array-49.0.0/src/array/list_array.rs:289:19
   4: datafusion_common::scalar::ScalarValue::try_from_array
             at /Users/huangweijun/Desktop/rust/arrow-datafusion/datafusion/common/src/scalar.rs:2147:36
   5: datafusion_physical_expr::functions::make_scalar_function_with_hints::{{closure}}::{{closure}}
             at /Users/huangweijun/Desktop/rust/arrow-datafusion/datafusion/physical-expr/src/functions.rs:248:48
   6: core::result::Result<T,E>::and_then
             at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/core/src/result.rs:1319:22
   7: datafusion_physical_expr::functions::make_scalar_function_with_hints::{{closure}}
             at /Users/huangweijun/Desktop/rust/arrow-datafusion/datafusion/physical-expr/src/functions.rs:248:26
   8: datafusion_physical_expr::functions::create_physical_fun::{{closure}}
             at /Users/huangweijun/Desktop/rust/arrow-datafusion/datafusion/physical-expr/src/functions.rs:414:13
   9: <datafusion_physical_expr::scalar_function::ScalarFunctionExpr as datafusion_physical_expr::physical_expr::PhysicalExpr>::evaluate
             at /Users/huangweijun/Desktop/rust/arrow-datafusion/datafusion/physical-expr/src/scalar_function.rs:161:9
  10: datafusion_optimizer::simplify_expressions::expr_simplifier::ConstEvaluator::evaluate_to_scalar
             at /Users/huangweijun/Desktop/rust/arrow-datafusion/datafusion/optimizer/src/simplify_expressions/expr_simplifier.rs:390:23
  11: <datafusion_optimizer::simplify_expressions::expr_simplifier::ConstEvaluator as datafusion_common::tree_node::TreeNodeRewriter>::mutate
             at /Users/huangweijun/Desktop/rust/arrow-datafusion/datafusion/optimizer/src/simplify_expressions/expr_simplifier.rs:284:44
  12: datafusion_common::tree_node::TreeNode::rewrite
             at /Users/huangweijun/Desktop/rust/arrow-datafusion/datafusion/common/src/tree_node.rs:205:13
  13: datafusion_optimizer::simplify_expressions::expr_simplifier::ExprSimplifier<S>::simplify
             at /Users/huangweijun/Desktop/rust/arrow-datafusion/datafusion/optimizer/src/simplify_expressions/expr_simplifier.rs:142:9
  14: datafusion_optimizer::simplify_expressions::simplify_exprs::SimplifyExpressions::optimize_internal::{{closure}}
             at /Users/huangweijun/Desktop/rust/arrow-datafusion/datafusion/optimizer/src/simplify_expressions/simplify_exprs.rs:102:29
  15: core::iter::adapters::map::map_try_fold::{{closure}}
             at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/core/src/iter/adapters/map.rs:91:28
  16: core::iter::traits::iterator::Iterator::try_fold
             at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/core/src/iter/traits/iterator.rs:2303:21
  17: <core::iter::adapters::map::Map<I,F> as core::iter::traits::iterator::Iterator>::try_fold
             at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/core/src/iter/adapters/map.rs:117:9
  18: <core::iter::adapters::GenericShunt<I,R> as core::iter::traits::iterator::Iterator>::try_fold
             at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/core/src/iter/adapters/mod.rs:195:9
  19: <I as alloc::vec::in_place_collect::SpecInPlaceCollect<T,I>>::collect_in_place
             at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/alloc/src/vec/in_place_collect.rs:258:13
  20: alloc::vec::in_place_collect::<impl alloc::vec::spec_from_iter::SpecFromIter<T,I> for alloc::vec::Vec<T>>::from_iter
             at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/alloc/src/vec/in_place_collect.rs:182:28
  21: <alloc::vec::Vec<T> as core::iter::traits::collect::FromIterator<T>>::from_iter
             at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/alloc/src/vec/mod.rs:2696:9
  22: core::iter::traits::iterator::Iterator::collect
             at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/core/src/iter/traits/iterator.rs:1895:9
  23: <core::result::Result<V,E> as core::iter::traits::collect::FromIterator<core::result::Result<A,E>>>::from_iter::{{closure}}
             at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/core/src/result.rs:1932:51
  24: core::iter::adapters::try_process
             at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/core/src/iter/adapters/mod.rs:164:17
  25: <core::result::Result<V,E> as core::iter::traits::collect::FromIterator<core::result::Result<A,E>>>::from_iter
             at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/core/src/result.rs:1932:9
  26: core::iter::traits::iterator::Iterator::collect
             at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/core/src/iter/traits/iterator.rs:1895:9
  27: datafusion_optimizer::simplify_expressions::simplify_exprs::SimplifyExpressions::optimize_internal
             at /Users/huangweijun/Desktop/rust/arrow-datafusion/datafusion/optimizer/src/simplify_expressions/simplify_exprs.rs:96:20
  28: <datafusion_optimizer::simplify_expressions::simplify_exprs::SimplifyExpressions as datafusion_optimizer::optimizer::OptimizerRule>::try_optimize
             at /Users/huangweijun/Desktop/rust/arrow-datafusion/datafusion/optimizer/src/simplify_expressions/simplify_exprs.rs:57:17
  29: datafusion_optimizer::optimizer::Optimizer::optimize_recursively
             at /Users/huangweijun/Desktop/rust/arrow-datafusion/datafusion/optimizer/src/optimizer.rs:419:18
  30: datafusion_optimizer::optimizer::Optimizer::optimize
             at /Users/huangweijun/Desktop/rust/arrow-datafusion/datafusion/optimizer/src/optimizer.rs:294:21
  31: datafusion::execution::context::SessionState::optimize
             at /Users/huangweijun/Desktop/rust/arrow-datafusion/datafusion/core/src/execution/context/mod.rs:1791:13
  32: datafusion::execution::context::SessionState::create_physical_plan::{{closure}}
             at /Users/huangweijun/Desktop/rust/arrow-datafusion/datafusion/core/src/execution/context/mod.rs:1806:28
  33: datafusion::dataframe::DataFrame::create_physical_plan::{{closure}}
             at /Users/huangweijun/Desktop/rust/arrow-datafusion/datafusion/core/src/dataframe/mod.rs:154:61
  34: datafusion::dataframe::DataFrame::collect::{{closure}}
             at /Users/huangweijun/Desktop/rust/arrow-datafusion/datafusion/core/src/dataframe/mod.rs:758:48
  35: datafusion_cli::exec::exec_and_print::{{closure}}
             at ./src/exec.rs:231:36
  36: datafusion_cli::exec::exec_from_repl::{{closure}}
             at ./src/exec.rs:168:65
  37: datafusion_cli::main::{{closure}}
             at ./src/main.rs:221:14
  38: tokio::runtime::park::CachedParkThread::block_on::{{closure}}
             at /Users/huangweijun/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/runtime/park.rs:282:63
  39: tokio::runtime::coop::with_budget
             at /Users/huangweijun/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/runtime/coop.rs:107:5
  40: tokio::runtime::coop::budget
             at /Users/huangweijun/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/runtime/coop.rs:73:5
  41: tokio::runtime::park::CachedParkThread::block_on
             at /Users/huangweijun/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/runtime/park.rs:282:31
  42: tokio::runtime::context::blocking::BlockingRegionGuard::block_on
             at /Users/huangweijun/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/runtime/context/blocking.rs:66:9
  43: tokio::runtime::scheduler::multi_thread::MultiThread::block_on::{{closure}}
             at /Users/huangweijun/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/runtime/scheduler/multi_thread/mod.rs:87:13
  44: tokio::runtime::context::runtime::enter_runtime
             at /Users/huangweijun/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/runtime/context/runtime.rs:65:16
  45: tokio::runtime::scheduler::multi_thread::MultiThread::block_on
             at /Users/huangweijun/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/runtime/scheduler/multi_thread/mod.rs:86:9
  46: tokio::runtime::runtime::Runtime::block_on
             at /Users/huangweijun/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/runtime/runtime.rs:350:45
  47: datafusion_cli::main
             at ./src/main.rs:215:5
  48: core::ops::function::FnOnce::call_once
             at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/core/src/ops/function.rs:250:5
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

I am not sure it is an issue 🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can try make_array(&[]) for empty array. Also, you need to fix the return_type in datafusion/expr/src/built_in_function.rs too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should return empty array or null in this case. DuckDB return empty array, Clickhouse return null. I

I check DuckDB results, which are NULL also. @jayzhan211

D select typeof( list_intersect(null, null));
┌────────────────────────────────────┐
│ typeof(list_intersect(NULL, NULL)) │
│              varchar               │
├────────────────────────────────────┤
│ NULL                               │
└────────────────────────────────────┘
D select typeof( list_intersect(null, []));
┌─────────────────────────────────────────────────┐
│ typeof(list_intersect(NULL, main.list_value())) │
│                     varchar                     │
├─────────────────────────────────────────────────┤
│ NULL                                            │
└─────────────────────────────────────────────────┘

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Screenshot 2023-12-18 at 07 41 05

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should return empty array or null in this case. DuckDB return empty array, Clickhouse return null. I

I check DuckDB results, which are NULL also. @jayzhan211

D select typeof( list_intersect(null, null));
┌────────────────────────────────────┐
│ typeof(list_intersect(NULL, NULL)) │
│              varchar               │
├────────────────────────────────────┤
│ NULL                               │
└────────────────────────────────────┘
D select typeof( list_intersect(null, []));
┌─────────────────────────────────────────────────┐
│ typeof(list_intersect(NULL, main.list_value())) │
│                     varchar                     │
├─────────────────────────────────────────────────┤
│ NULL                                            │
└─────────────────────────────────────────────────┘

the first element should be list.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the first element is null, you should return null instead

@alamb
Copy link
Contributor

alamb commented Dec 15, 2023

@jayzhan211 do you have some time to review this PR?

query ?
select array_union(null, [1, 1, 2, 2, 3, 3]);
----
[1, 2, 3]
Copy link
Contributor

@jayzhan211 jayzhan211 Dec 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Union is correct

let converter = RowConverter::new(vec![SortField::new(dt.clone())])?;
for (first_arr, second_arr) in l.iter().zip(r.iter()) {
if let (Some(first_arr), Some(second_arr)) = (first_arr, second_arr) {
let l_values = converter.convert_columns(&[first_arr])?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems you iterate each row and convert_columns based on it. You call convert_columns N times, N is num of rows. But, previous implementation we just do one convert_columns and iterate the each row.

It seems this change is more costly.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the previous implementation is to convert_columns for all values at once, the current is to convert_columns for each element each time. The difference in cost is negligible. @jayzhan211

// Null type
(DataType::Null, DataType::List(field))
| (DataType::List(field), DataType::Null) => {
let array = match array1.data_type() {
Copy link
Contributor

@jayzhan211 jayzhan211 Dec 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is better just split null, list and list, null into two to avoid another type checking inside it.

@github-actions github-actions bot added the logical-expr Logical plan and expressions label Dec 18, 2023
}

let values = converter.convert_rows(rows)?;
let field = Arc::new(Field::new("item", dt, true));
Copy link
Contributor

@jayzhan211 jayzhan211 Dec 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why should we create a field here instead of passing the cloned one? if you got the error related to field, likely there is error in return_type or elsewhere

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch 👍

Copy link
Contributor

@jayzhan211 jayzhan211 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @Weijun-H and @jayzhan211

@alamb alamb merged commit 58b0a2b into apache:main Dec 27, 2023
@Weijun-H Weijun-H mentioned this pull request Dec 29, 2023
19 tasks
appletreeisyellow pushed a commit to appletreeisyellow/datafusion that referenced this pull request Jan 4, 2024
… function (apache#8516)

* Refactor array_union and array_intersect functions

* fix cli

* fix ci

* add tests for null

* modify the return type

* update tests

* fix clippy

* fix clippy

* add tests for largelist

* fix clippy

* Add field parameter to generic_set_lists() function

* Add large array drop statements

* fix clippy
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
logical-expr Logical plan and expressions physical-expr Changes to the physical-expr crates sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants