-
Notifications
You must be signed in to change notification settings - Fork 245
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Reimplement check for non-regexp strings using RegexParser #4681
Conversation
tests/src/test/scala/com/nvidia/spark/rapids/RegularExpressionParserSuite.scala
Outdated
Show resolved
Hide resolved
…ParserSuite.scala Co-authored-by: Nghia Truong <ttnghia@users.noreply.github.com>
tests/src/test/scala/com/nvidia/spark/rapids/RegularExpressionParserSuite.scala
Show resolved
Hide resolved
Signed-off-by: Andy Grove <andygrove@nvidia.com>
} | ||
|
||
test("detect non-regexp strings") { | ||
val strings = Seq("\\.", "A", ",", "\t", ":", "") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks wrong. For example if I did a split with "\\."
as the separator on "a.b.c"
I would expect an output of ["a", "b", "c"]
. But I cannot take "\\."
and pass it into a non-regexp string split, because it will try to look for a "\\"
character in front of the '.'
. Am I wrong/missing something? I think we could still support it, but we would have to transpile the regular expression into a constant string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I agree, this isn't what we want here. Moving this to draft for now and will rethink this approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated to treat any string containing escaped characters as a regular expression.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we have a follow on issue to do the transpile that I suggested. From a performance and memory standpoint NOT doing a regular expression is going to be much faster than trying to do it.
I've filed #4685 for this |
build |
This PR reimplements the check used in GpuStringSplit to determine if a string is a regular expression or not. The previous approach had limitations and now that we have a regular expression parser we can improve the check so that there are fewer false positives.
This change is needed to support the work for #3542