-
Notifications
You must be signed in to change notification settings - Fork 244
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
[FEA] support ascii function #9585
Comments
ASCII is an interesting one. It returns the numeric value of the first character of a string. So it works for more than just ASCII encoded values, but whatever. The Spark code is fairly simple with only a few corner cases The problem is for CUDF. Doing the substring is simple enough But converting the remaining byte array to an int. meaning the equivalent of https://docs.oracle.com/javase/8/docs/api/java/lang/String.html#codePointAt-int- Is the code that we need to do it, but it is not in a standalone kernel, and if we have to write a new kernel to support this, then we might as well write one that does everything we need. Unless CUDF has a need for something similar. |
I guess we could do our own utf8 to codepoint conversion using bit shifts/casts comparisons/etc to make it work. But that feels really really slow and potentially error prone. |
Take a look at |
@davidwendt Thanks, I missed that one. So it looks like we just need to put in some cudf jni APIs for the codepoints API. |
I think it needs a custom kernel to fully supported it, but if we only want to support ascii and latin-1, we can support it quickly in plugin. In databricks's doc, it says that
Under my tests, the behavior between Apache Spark and Databricks are matched, which is Java Here is a work around to support ascii 0~255 only without kernel:
|
@nvliyuan, is it good enough for customers to just support ASCII 0~255 (ASCII and Latin-1 Supplement) as a first step? |
I am fine if we restrict the range of values supported, but it has to be off by default unless we can match the result in all cases. Databricks can say that some ranges are undefined behavior, but Apache Spark does not https://spark.apache.org/docs/latest/api/sql/index.html#ascii And neither does java. https://docs.oracle.com/javase/8/docs/api/java/lang/String.html#codePointAt-int- So what we are saying is that java's codePointAt API has undefined behavior for any character not in the range specified by Databricks? |
Confirmed with customers, ASCII 0~255 is good enough, for others we need to fallback rather than crash. |
I think the doc is to explain that the result of |
I believe GPU job should always return the same result as CPU run, otherwise it should be regard as a bug? @revans2 |
We cannot fall back for the others. We don't have enough information by the time that we start planning to be able to fall back. If we knew what the characters were at planning time we could just replace it with the result up front.
If the |
Thanks for the explanation, I will double-confirm and update. |
@thirtiseven confirmed ASCII 0~255 is good enough, thx |
This implements JNI work for strings::code_points to expose the API to Java usage. It will be useful for NVIDIA/spark-rapids#9585 Authors: - Haoyang Li (https://github.com/thirtiseven) - Chong Gao (https://github.com/res-life) Approvers: - Jason Lowe (https://github.com/jlowe) - Nghia Truong (https://github.com/ttnghia) URL: #14533
I wish we can support ascii function.
eg.
The text was updated successfully, but these errors were encountered: