-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Prototype implementing DataFusion functions / operators using arrow-udf
liibrary
#11413
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Comments
take |
Sorry it takes longer than I expected to make this works end-to-end. I plan to make an ScalarUDF with From my perspective (feel free to correct me if I'm wrong), Good points:
Bad points:
Neural:
I'd think we could replace some string functions, that are not supported by datafusion/datafusion/physical-expr/src/expressions/binary.rs Lines 616 to 664 in f204869
An example would be // declare concat
#[function("concat(string, string) -> string")]
#[function("concat(largestring, largestring) -> largestring")]
fn concat(lhs: &str, rhs: &str) -> String {
format!("{}{}", lhs, rhs)
}
// reference concat
apply_udf(
&ColumnarValue::Array(left),
&ColumnarValue::Array(right),
&Field::new("", DataType::Utf8, true),
"concat",
) CC @alamb |
Btw... the
Will it be better to use RecordBatches instead of ColumnValues in PhysicalExpr evaluate function? It will provide finely integrations with arrow-rs eco system. |
One thing that ColumnarValue does well is represent single values efficiently (aka I don't see any fundamental reason we couldn't use RecordBatch if we figured out a better way to represent single row |
Thank you so much @xinlifoobar -- this is really helpful and a great analysis (I think the pros/cons you identified make a lot of sense to me) From what I can see, if we wanted to proceed with using Here are some additional discussions
I think this is part of the same concept as discussed on https://lists.apache.org/thread/x8wvlkfr0osl15o52rw85wom0p4v05x6 -- basically the arrow-udf library's scope is large enough to encompass things like a function registry that DataFusion already has
I do think being able to special case scalar value is a critical requirement for performance. I will post about your findings on the mailing lists and let's see what the authors of arrow-udf have to say |
FWIW this implementation of concat would likely perform pretty poorly compared to a hand written one as it will both create and allocate a new temporary // declare concat
#[function("concat(string, string) -> string")]
#[function("concat(largestring, largestring) -> largestring")]
fn concat(lhs: &str, rhs: &str) -> String {
format!("{}{}", lhs, rhs)
} |
In this case, a writer-style return value is supported to avoid the overhead. #[function("concat(string, string) -> string")]
#[function("concat(largestring, largestring) -> largestring")] // will be supported soon
fn concat(lhs: &str, rhs: &str, output: &mut impl std::fmt::Write) {
write!(output, "{}{}", lhs, rhs).unwrap();
} The same technique can be applied to // to be supported
#[function("concat(binary, binary) -> binary")]
#[function("concat(largebinary, largebinary) -> largebinary")]
fn concat(lhs: &[u8], rhs: &[u8], output: &mut impl std::io::Write) {
output.write_all(lhs).unwrap();
output.write_all(rhs).unwrap();
} |
This PR will add an additional meta parameter `visibility` to `arrow-udf`. I might want this to be added while working on apache/datafusion#11413. Sometimes it is better to reference the symbol directly instead of using the function registry. --------- Co-authored-by: Runji Wang <wangrunji0408@163.com>
To clarify, the arrow-udf project provides several things. (I just tried to explain it here arrow-udf/arrow-udf#55) For DataFusion, we are mainly interested in the |
I'm inspired by apache/datafusion#11413. It seems they just need the function macro, not like "UDF", which surprised me a little. --------- Signed-off-by: xxchan <xxchan22f@gmail.com>
the |
I understand we're not concerned whether I don't know how does |
The A new String object will be allocated if the #[function("concat(string, string) -> string")]
#[function("concat(largestring, largestring) -> largestring")]
fn concat(lhs: &str, rhs: &str, output: &mut impl std::fmt::Write) {
write!(output, "{}{}", lhs, rhs).unwrap();
} |
@wangrunji0408 thanks, that's valuable! I spend quite some time today benchmarking various hypothetical implementations where concatenation logic is maximally separated from columnar (scalar/array) handling. Here couple conclusions
Footnotes
|
BTW We are discussing something similar here: |
Thanks for pointing this out. |
That would make sense and was the rationale on the original inclusion from @JasonLi-cn (❤️ )
I think the other thing that is important is if the function produces null on a null input (which has some term but I can't remember now). With those two properties I do think you can skip most null handling. This PR from @kazuyukitanimura in arrow may also help improve the situation for normal arrow |
is this a null-call function? A null-call function is an SQL-invoked function that is defined to return the null value if any of its input argu- ments is the null value. A null-call function is an SQL-invoked function whose specifies “RETURNS NULL ON NULL INPUT”.
yes. many functions are "return null on null; return non-null on non-null". for these null handling can be externalized, and the function "business logic" abstracted. |
Indeed I was thinking of
Exactly! |
I think the prototype was done. Let's continue the discussion in #12635 |
Is your feature request related to a problem or challenge?
Related to the discussion on #11192 with @Xuanwo
RisingWave has a library for automatically creating vectorized implementations of functions (e.g. that operate on arrow arrays) from scalar implementations
The library is here: https://github.com/risingwavelabs/arrow-udf
A blog post describing it is here: https://risingwave.com/blog/simplifying-sql-function-implementation-with-rust-procedural-macro/
DataFusion uses macros to do something similar in binary.rs but they are pretty hard to read / understand in my opinon:
datafusion/datafusion/physical-expr/src/expressions/binary.rs
Lines 118 to 130 in 7a23ea9
One main benefit I can see to switching to https://github.com/risingwavelabs/arrow-udf is that we could then extend arrow-udf to support Dictionary and StringView and maybe other types to generate fast kernels for multiple different array layouts.
Describe the solution you'd like
I think it would be great if someone could evaluate the feasibility of using the macros in https://github.com/risingwavelabs/arrow-udf to implement Datafusion's operations (and maybe eventually functions etc)
Describe alternatives you've considered
I suggest a POC that picks one or two functions (maybe string equality or regexp_match or something) and tries to use
arrow-udf
s function macro instead.Here is an example of how to use it: https://docs.rs/arrow-udf/0.3.0/arrow_udf/
Additional context
No response
The text was updated successfully, but these errors were encountered: