-
Notifications
You must be signed in to change notification settings - Fork 1.5k
feat: Enforce the uniqueness of map key name for the map/make_map function #12153
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Weijun-H overall LGTM. I also tested some cases for nested type as the key. They work well. 👍
# Test duplicate keys
query error DataFusion error: Execution error: map key must be unique, duplicate key found: \[1, 2\]
SELECT MAP([[1,2], [1,2], []], [41, 33, null]);
# Test duplicate keys
query error DataFusion error: Execution error: map key must be unique, duplicate key found: \[\{1:1\}\]
SELECT MAP([Map {1:'1'}, Map {1:'1'}, Map {2:'2'}], [41, 33, null]);
Nice catch! I added them to the tests |
I plan to review this today |
for i in 0..key_array.len() { | ||
let row_keys = key_array.slice(i, 1); | ||
let row_keys = first_array_for_list(row_keys.as_ref()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use datafusion_common::utils::list_to_arrays::<O>(keys)
here ?
Also any specific reason for making this a inner fn ? why not move it out ? For better readability
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Look good. But suggested some minor change for better code readability
@jayzhan211 do you have some time to review this pr? |
make_map_batch_internal(keys, values, can_evaluate_to_const, args[0].data_type()) | ||
} | ||
|
||
fn check_unique_keys(array: &dyn Array) -> Result<()> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will it be faster if we check early?
fn plan_make_map(&self, args: Vec<Expr>) -> Result<PlannerResult<Vec<Expr>>> {
if args.len() % 2 != 0 {
return plan_err!("make_map requires an even number of arguments");
}
let (keys, values): (Vec<_>, Vec<_>) =
args.into_iter().enumerate().partition(|(i, _)| i % 2 == 0);
// check here?
let keys = make_array(keys.into_iter().map(|(_, e)| e).collect());
let values = make_array(values.into_iter().map(|(_, e)| e).collect());
Ok(PlannerResult::Planned(Expr::ScalarFunction(
ScalarFunction::new_udf(map_udf(), vec![keys, values]),
)))
}
pub fn map(keys: Vec<Expr>, values: Vec<Expr>) -> Expr {
// check here
let keys = make_array(keys);
let values = make_array(values);
Expr::ScalarFunction(ScalarFunction::new_udf(map_udf(), vec![keys, values]))
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could have a function for these two
fn create_map_from_keys_and_values(keys: Vec<Expr>, values: Vec<Expr>) -> Expr {
// check keys
let keys = make_array(keys);
let values = make_array(values);
map_udf().call(vec![keys, values])
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another benefit is that we can have map_with_checks
version that avoid unique key check
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could check for unique keys earlier in the make_map
function, but since the MAP
function is a UDF, the unique key validation needs to be handled in the make_map_batch
function. This approach, however, will result in the make_map
function performing the check twice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The uniqueness is guaranteed in semantic layer, not in the map
internal function. We don't need to check again in the internal function. Consider that if someone want map function that allows duplicate keys, the check you did blocks their requirement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, however, it cannot guarantee this query.
External error: The query was expected to fail but actually succeeded:
[SQL] SELECT MAP(['POST', 'HEAD', 'POST'], [41, 33, NULL]);
According to DuckDB behavior, this should have failed because the keys in the map are not unique.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found my approach doesn't work for column, we could only check them in invoke
4cafcec
to
5acf371
Compare
2508a8f
to
c84c9ee
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Thank you @dharanad @goldmedal @jayzhan211 |
Which issue does this PR close?
Closes #11437
Rationale for this change
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?