-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Easier Dataframe API for map
#11546
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Comments
take |
@jayzhan211 I have drafted a PR #11560 for it. The design differs from your proposal, but it makes sense to me. Could you take a look at it? The core concept is providing an expression function and wrapping Maybe we can rename |
My concern is that it might be slower because of additional |
I see. I think I can do some benchmarks for them. Because I have some concerns mentioned in #11452 (comment) for changing the |
I think in this case we should adjust the coercion rule with |
Ideally MapFunc should have the arguments that have minimum transformation and computation cost for creating MapArray, so we can get the most efficient implementation.
|
I followed #11526 to create another implementation for
I ran the benchmark many times, and each time I got similar results. Referring to the result, I think we can just use the original design here. What do you think? By the way, I found that the compile time to run the benchmark in the core is very long. It takes about 9 minutes. I'm not sure if that's normal. 😢
|
pub fn map(keys: Vec<Expr>, values: Vec<Expr>) -> Expr {
let keys = make_array(keys);
let values = make_array(values);
Expr::ScalarFunction(ScalarFunction::new_udf(
map_udf(),
vec![keys, values],
))
}
pub fn map_from_array(keys: Vec<Expr>, values: Vec<Expr>) -> Expr {
let keys = make_array(keys);
let values = make_array(values);
Expr::ScalarFunction(ScalarFunction::new_udf(
map_one_udf(),
vec![keys, values],
))
} It seems they both compute pub fn map_from_array(keys: Vec<Expr>, values: Vec<Expr>) -> Expr {
let mut args = keys;
args.extend(values);
Expr::ScalarFunction(ScalarFunction::new_udf(
map_one_udf(),
args,
))
} |
I'm not sure whether it is expected, I guess because |
It it not the case for running |
Oops... Sorry about that. I forgot to remove this. I will provide another benchmark result. Many thanks. |
Here is the benchmark result after removing
I think the result is really bad but I tried to understand why datafusion/datafusion/functions-array/src/make_array.rs Lines 102 to 104 in 5da7ab3
I will try to use this way to modify the two version and give another benchmark. |
Ok, I think it's getting worse.
I also tried to remove
Just pass an args vector to Actually, I found that In conclusion, the original design (using |
This comment was marked as outdated.
This comment was marked as outdated.
In theory, I didn't expect this but I don't understand why. We can move on with |
functions-nested? For array, struct, map |
It looks good, but I think we can have another PR for it. It's also related to changing the name of |
Dataframe API for
map
expects us to pass args withmake_array
i.e.
I think we could have easier one with without
make_array
To achieve this we may need to change the arguments of
MapFunc
from two array toVec<Expr>
, which the first half arekeys
, another half arevalues
.Originally posted by @jayzhan211 in #11452 (comment)
Dataframe API is somthing used for building
Expr
Most of them are written in macro if they have similar pattern, others are individual function, like
count_distinct
The idea of
map
is similar toThe text was updated successfully, but these errors were encountered: