Skip to content

feat: Enforce the uniqueness of map key name for the map/make_map function #12153

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Merged
merged 9 commits into from
Sep 4, 2024

Conversation

Weijun-H
Copy link
Member

@Weijun-H Weijun-H commented Aug 25, 2024

Which issue does this PR close?

Closes #11437

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added sqllogictest SQL Logic Tests (.slt) common Related to common crate labels Aug 25, 2024
@Weijun-H Weijun-H changed the title feat: Enforce the uniqueness of map key name for the map/make_map fun… feat: Enforce the uniqueness of map key name for the map/make_map function Aug 25, 2024
Copy link
Contributor

@goldmedal goldmedal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Weijun-H overall LGTM. I also tested some cases for nested type as the key. They work well. 👍

# Test duplicate keys
query error DataFusion error: Execution error: map key must be unique, duplicate key found: \[1, 2\]
SELECT MAP([[1,2], [1,2], []], [41, 33, null]);

# Test duplicate keys
query error DataFusion error: Execution error: map key must be unique, duplicate key found: \[\{1:1\}\]
SELECT MAP([Map {1:'1'}, Map {1:'1'}, Map {2:'2'}], [41, 33, null]);

@Weijun-H
Copy link
Member Author

Thanks @Weijun-H overall LGTM. I also tested some cases for nested type as the key. They work well. 👍

# Test duplicate keys
query error DataFusion error: Execution error: map key must be unique, duplicate key found: \[1, 2\]
SELECT MAP([[1,2], [1,2], []], [41, 33, null]);

# Test duplicate keys
query error DataFusion error: Execution error: map key must be unique, duplicate key found: \[\{1:1\}\]
SELECT MAP([Map {1:'1'}, Map {1:'1'}, Map {2:'2'}], [41, 33, null]);

Nice catch! I added them to the tests

@dharanad
Copy link
Contributor

I plan to review this today

Comment on lines 72 to 74
for i in 0..key_array.len() {
let row_keys = key_array.slice(i, 1);
let row_keys = first_array_for_list(row_keys.as_ref());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use datafusion_common::utils::list_to_arrays::<O>(keys) here ?
Also any specific reason for making this a inner fn ? why not move it out ? For better readability

Copy link
Contributor

@dharanad dharanad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Look good. But suggested some minor change for better code readability

@Weijun-H
Copy link
Member Author

Weijun-H commented Sep 1, 2024

@jayzhan211 do you have some time to review this pr?

make_map_batch_internal(keys, values, can_evaluate_to_const, args[0].data_type())
}

fn check_unique_keys(array: &dyn Array) -> Result<()> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will it be faster if we check early?

    fn plan_make_map(&self, args: Vec<Expr>) -> Result<PlannerResult<Vec<Expr>>> {
        if args.len() % 2 != 0 {
            return plan_err!("make_map requires an even number of arguments");
        }

        let (keys, values): (Vec<_>, Vec<_>) =
            args.into_iter().enumerate().partition(|(i, _)| i % 2 == 0);
       // check here?
        let keys = make_array(keys.into_iter().map(|(_, e)| e).collect());
        let values = make_array(values.into_iter().map(|(_, e)| e).collect());

        Ok(PlannerResult::Planned(Expr::ScalarFunction(
            ScalarFunction::new_udf(map_udf(), vec![keys, values]),
        )))
    }
pub fn map(keys: Vec<Expr>, values: Vec<Expr>) -> Expr {
    // check here
    let keys = make_array(keys);
    let values = make_array(values);
    Expr::ScalarFunction(ScalarFunction::new_udf(map_udf(), vec![keys, values]))
}

Copy link
Contributor

@jayzhan211 jayzhan211 Sep 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could have a function for these two

        fn create_map_from_keys_and_values(keys: Vec<Expr>, values: Vec<Expr>) -> Expr {
            // check keys
            let keys = make_array(keys);
            let values = make_array(values);
            map_udf().call(vec![keys, values])
        }

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another benefit is that we can have map_with_checks version that avoid unique key check

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could check for unique keys earlier in the make_map function, but since the MAP function is a UDF, the unique key validation needs to be handled in the make_map_batch function. This approach, however, will result in the make_map function performing the check twice.

Copy link
Contributor

@jayzhan211 jayzhan211 Sep 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The uniqueness is guaranteed in semantic layer, not in the map internal function. We don't need to check again in the internal function. Consider that if someone want map function that allows duplicate keys, the check you did blocks their requirement.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, however, it cannot guarantee this query.

External error: The query was expected to fail but actually succeeded:
[SQL] SELECT MAP(['POST', 'HEAD', 'POST'], [41, 33, NULL]);

According to DuckDB behavior, this should have failed because the keys in the map are not unique.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found my approach doesn't work for column, we could only check them in invoke

@Weijun-H Weijun-H force-pushed the 11437-unique-key-in-map branch from 4cafcec to 5acf371 Compare September 3, 2024 14:51
@Weijun-H Weijun-H force-pushed the 11437-unique-key-in-map branch from 2508a8f to c84c9ee Compare September 4, 2024 08:53
@Weijun-H Weijun-H requested a review from jayzhan211 September 4, 2024 08:54
Copy link
Contributor

@jayzhan211 jayzhan211 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@Weijun-H Weijun-H merged commit 9ab2724 into apache:main Sep 4, 2024
24 checks passed
@Weijun-H Weijun-H deleted the 11437-unique-key-in-map branch September 4, 2024 09:54
@Weijun-H
Copy link
Member Author

Weijun-H commented Sep 4, 2024

Thank you @dharanad @goldmedal @jayzhan211

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
common Related to common crate sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Enforce the uniqueness of map key name for the map/make_map function
4 participants