-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Refactor array_union
and array_intersect
functions to one general function
#8516
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Conversation
f529f83
to
2aa3635
Compare
query ? | ||
select array_intersect([1, 1, 2, 2, 3, 3], null); | ||
---- | ||
[1, 2, 3] | ||
|
||
query ? | ||
select array_intersect(null, [1, 1, 2, 2, 3, 3]); | ||
---- | ||
[1, 2, 3] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found previous tests forgot to test that the array should be distinct when one of the arrays is NULL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should return empty array or null in this case. DuckDB return empty array, Clickhouse return null. I
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previous implement is incorrect
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can return empty array. array_intersect(array, null) -> array (empty)
The idea is turn null to empty array.
array_union(array, null) -> array
array_except(array, null) -> array
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to return empty array for array_intersect, but met this panic.
[/Users/huangweijun/Desktop/rust/arrow-datafusion/datafusion/physical-expr/src/array_expressions.rs:1641] new_empty_array(array1.data_type()).data_type() = List(
Field {
name: "item",
data_type: Int64,
nullable: true,
dict_id: 0,
dict_is_ordered: false,
metadata: {},
},
)
thread 'main' panicked at 'index out of bounds: the len is 1 but the index is 1', /Users/huangweijun/.cargo/registry/src/index.crates.io-6f17d22bba15001f/arrow-array-49.0.0/src/array/list_array.rs:289:19
stack backtrace:
0: rust_begin_unwind
at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/std/src/panicking.rs:593:5
1: core::panicking::panic_fmt
at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/core/src/panicking.rs:67:14
2: core::panicking::panic_bounds_check
at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/core/src/panicking.rs:162:5
3: arrow_array::array::list_array::GenericListArray<OffsetSize>::value
at /Users/huangweijun/.cargo/registry/src/index.crates.io-6f17d22bba15001f/arrow-array-49.0.0/src/array/list_array.rs:289:19
4: datafusion_common::scalar::ScalarValue::try_from_array
at /Users/huangweijun/Desktop/rust/arrow-datafusion/datafusion/common/src/scalar.rs:2147:36
5: datafusion_physical_expr::functions::make_scalar_function_with_hints::{{closure}}::{{closure}}
at /Users/huangweijun/Desktop/rust/arrow-datafusion/datafusion/physical-expr/src/functions.rs:248:48
6: core::result::Result<T,E>::and_then
at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/core/src/result.rs:1319:22
7: datafusion_physical_expr::functions::make_scalar_function_with_hints::{{closure}}
at /Users/huangweijun/Desktop/rust/arrow-datafusion/datafusion/physical-expr/src/functions.rs:248:26
8: datafusion_physical_expr::functions::create_physical_fun::{{closure}}
at /Users/huangweijun/Desktop/rust/arrow-datafusion/datafusion/physical-expr/src/functions.rs:414:13
9: <datafusion_physical_expr::scalar_function::ScalarFunctionExpr as datafusion_physical_expr::physical_expr::PhysicalExpr>::evaluate
at /Users/huangweijun/Desktop/rust/arrow-datafusion/datafusion/physical-expr/src/scalar_function.rs:161:9
10: datafusion_optimizer::simplify_expressions::expr_simplifier::ConstEvaluator::evaluate_to_scalar
at /Users/huangweijun/Desktop/rust/arrow-datafusion/datafusion/optimizer/src/simplify_expressions/expr_simplifier.rs:390:23
11: <datafusion_optimizer::simplify_expressions::expr_simplifier::ConstEvaluator as datafusion_common::tree_node::TreeNodeRewriter>::mutate
at /Users/huangweijun/Desktop/rust/arrow-datafusion/datafusion/optimizer/src/simplify_expressions/expr_simplifier.rs:284:44
12: datafusion_common::tree_node::TreeNode::rewrite
at /Users/huangweijun/Desktop/rust/arrow-datafusion/datafusion/common/src/tree_node.rs:205:13
13: datafusion_optimizer::simplify_expressions::expr_simplifier::ExprSimplifier<S>::simplify
at /Users/huangweijun/Desktop/rust/arrow-datafusion/datafusion/optimizer/src/simplify_expressions/expr_simplifier.rs:142:9
14: datafusion_optimizer::simplify_expressions::simplify_exprs::SimplifyExpressions::optimize_internal::{{closure}}
at /Users/huangweijun/Desktop/rust/arrow-datafusion/datafusion/optimizer/src/simplify_expressions/simplify_exprs.rs:102:29
15: core::iter::adapters::map::map_try_fold::{{closure}}
at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/core/src/iter/adapters/map.rs:91:28
16: core::iter::traits::iterator::Iterator::try_fold
at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/core/src/iter/traits/iterator.rs:2303:21
17: <core::iter::adapters::map::Map<I,F> as core::iter::traits::iterator::Iterator>::try_fold
at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/core/src/iter/adapters/map.rs:117:9
18: <core::iter::adapters::GenericShunt<I,R> as core::iter::traits::iterator::Iterator>::try_fold
at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/core/src/iter/adapters/mod.rs:195:9
19: <I as alloc::vec::in_place_collect::SpecInPlaceCollect<T,I>>::collect_in_place
at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/alloc/src/vec/in_place_collect.rs:258:13
20: alloc::vec::in_place_collect::<impl alloc::vec::spec_from_iter::SpecFromIter<T,I> for alloc::vec::Vec<T>>::from_iter
at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/alloc/src/vec/in_place_collect.rs:182:28
21: <alloc::vec::Vec<T> as core::iter::traits::collect::FromIterator<T>>::from_iter
at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/alloc/src/vec/mod.rs:2696:9
22: core::iter::traits::iterator::Iterator::collect
at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/core/src/iter/traits/iterator.rs:1895:9
23: <core::result::Result<V,E> as core::iter::traits::collect::FromIterator<core::result::Result<A,E>>>::from_iter::{{closure}}
at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/core/src/result.rs:1932:51
24: core::iter::adapters::try_process
at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/core/src/iter/adapters/mod.rs:164:17
25: <core::result::Result<V,E> as core::iter::traits::collect::FromIterator<core::result::Result<A,E>>>::from_iter
at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/core/src/result.rs:1932:9
26: core::iter::traits::iterator::Iterator::collect
at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/core/src/iter/traits/iterator.rs:1895:9
27: datafusion_optimizer::simplify_expressions::simplify_exprs::SimplifyExpressions::optimize_internal
at /Users/huangweijun/Desktop/rust/arrow-datafusion/datafusion/optimizer/src/simplify_expressions/simplify_exprs.rs:96:20
28: <datafusion_optimizer::simplify_expressions::simplify_exprs::SimplifyExpressions as datafusion_optimizer::optimizer::OptimizerRule>::try_optimize
at /Users/huangweijun/Desktop/rust/arrow-datafusion/datafusion/optimizer/src/simplify_expressions/simplify_exprs.rs:57:17
29: datafusion_optimizer::optimizer::Optimizer::optimize_recursively
at /Users/huangweijun/Desktop/rust/arrow-datafusion/datafusion/optimizer/src/optimizer.rs:419:18
30: datafusion_optimizer::optimizer::Optimizer::optimize
at /Users/huangweijun/Desktop/rust/arrow-datafusion/datafusion/optimizer/src/optimizer.rs:294:21
31: datafusion::execution::context::SessionState::optimize
at /Users/huangweijun/Desktop/rust/arrow-datafusion/datafusion/core/src/execution/context/mod.rs:1791:13
32: datafusion::execution::context::SessionState::create_physical_plan::{{closure}}
at /Users/huangweijun/Desktop/rust/arrow-datafusion/datafusion/core/src/execution/context/mod.rs:1806:28
33: datafusion::dataframe::DataFrame::create_physical_plan::{{closure}}
at /Users/huangweijun/Desktop/rust/arrow-datafusion/datafusion/core/src/dataframe/mod.rs:154:61
34: datafusion::dataframe::DataFrame::collect::{{closure}}
at /Users/huangweijun/Desktop/rust/arrow-datafusion/datafusion/core/src/dataframe/mod.rs:758:48
35: datafusion_cli::exec::exec_and_print::{{closure}}
at ./src/exec.rs:231:36
36: datafusion_cli::exec::exec_from_repl::{{closure}}
at ./src/exec.rs:168:65
37: datafusion_cli::main::{{closure}}
at ./src/main.rs:221:14
38: tokio::runtime::park::CachedParkThread::block_on::{{closure}}
at /Users/huangweijun/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/runtime/park.rs:282:63
39: tokio::runtime::coop::with_budget
at /Users/huangweijun/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/runtime/coop.rs:107:5
40: tokio::runtime::coop::budget
at /Users/huangweijun/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/runtime/coop.rs:73:5
41: tokio::runtime::park::CachedParkThread::block_on
at /Users/huangweijun/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/runtime/park.rs:282:31
42: tokio::runtime::context::blocking::BlockingRegionGuard::block_on
at /Users/huangweijun/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/runtime/context/blocking.rs:66:9
43: tokio::runtime::scheduler::multi_thread::MultiThread::block_on::{{closure}}
at /Users/huangweijun/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/runtime/scheduler/multi_thread/mod.rs:87:13
44: tokio::runtime::context::runtime::enter_runtime
at /Users/huangweijun/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/runtime/context/runtime.rs:65:16
45: tokio::runtime::scheduler::multi_thread::MultiThread::block_on
at /Users/huangweijun/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/runtime/scheduler/multi_thread/mod.rs:86:9
46: tokio::runtime::runtime::Runtime::block_on
at /Users/huangweijun/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/runtime/runtime.rs:350:45
47: datafusion_cli::main
at ./src/main.rs:215:5
48: core::ops::function::FnOnce::call_once
at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/core/src/ops/function.rs:250:5
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
I am not sure it is an issue 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can try make_array(&[])
for empty array. Also, you need to fix the return_type
in datafusion/expr/src/built_in_function.rs
too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should return empty array or null in this case. DuckDB return empty array, Clickhouse return null. I
I check DuckDB results, which are NULL also. @jayzhan211
D select typeof( list_intersect(null, null));
┌────────────────────────────────────┐
│ typeof(list_intersect(NULL, NULL)) │
│ varchar │
├────────────────────────────────────┤
│ NULL │
└────────────────────────────────────┘
D select typeof( list_intersect(null, []));
┌─────────────────────────────────────────────────┐
│ typeof(list_intersect(NULL, main.list_value())) │
│ varchar │
├─────────────────────────────────────────────────┤
│ NULL │
└─────────────────────────────────────────────────┘
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should return empty array or null in this case. DuckDB return empty array, Clickhouse return null. I
I check DuckDB results, which are NULL also. @jayzhan211
D select typeof( list_intersect(null, null)); ┌────────────────────────────────────┐ │ typeof(list_intersect(NULL, NULL)) │ │ varchar │ ├────────────────────────────────────┤ │ NULL │ └────────────────────────────────────┘ D select typeof( list_intersect(null, [])); ┌─────────────────────────────────────────────────┐ │ typeof(list_intersect(NULL, main.list_value())) │ │ varchar │ ├─────────────────────────────────────────────────┤ │ NULL │ └─────────────────────────────────────────────────┘
the first element should be list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the first element is null, you should return null instead
@jayzhan211 do you have some time to review this PR? |
query ? | ||
select array_union(null, [1, 1, 2, 2, 3, 3]); | ||
---- | ||
[1, 2, 3] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Union is correct
let converter = RowConverter::new(vec![SortField::new(dt.clone())])?; | ||
for (first_arr, second_arr) in l.iter().zip(r.iter()) { | ||
if let (Some(first_arr), Some(second_arr)) = (first_arr, second_arr) { | ||
let l_values = converter.convert_columns(&[first_arr])?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems you iterate each row and convert_columns based on it. You call convert_columns
N times, N is num of rows. But, previous implementation we just do one convert_columns
and iterate the each row.
It seems this change is more costly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But the previous implementation is to convert_columns
for all values at once, the current is to convert_columns
for each element each time. The difference in cost is negligible. @jayzhan211
// Null type | ||
(DataType::Null, DataType::List(field)) | ||
| (DataType::List(field), DataType::Null) => { | ||
let array = match array1.data_type() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is better just split null, list and list, null into two to avoid another type checking inside it.
6fee213
to
fe1083b
Compare
} | ||
|
||
let values = converter.convert_rows(rows)?; | ||
let field = Arc::new(Field::new("item", dt, true)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why should we create a field here instead of passing the cloned one? if you got the error related to field, likely there is error in return_type or elsewhere
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
77d9480
to
c34e6c0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @Weijun-H and @jayzhan211
… function (apache#8516) * Refactor array_union and array_intersect functions * fix cli * fix ci * add tests for null * modify the return type * update tests * fix clippy * fix clippy * add tests for largelist * fix clippy * Add field parameter to generic_set_lists() function * Add large array drop statements * fix clippy
Which issue does this PR close?
Closes #.
Rationale for this change
array_intersect
LargeList
array_union
andarray_intersect
logicWhat changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?