Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Make regexp pattern [^a] consistent with Spark for multiline strings #4255

Merged
merged 5 commits into from
Dec 6, 2021

Conversation

andygrove
Copy link
Contributor

@andygrove andygrove commented Dec 1, 2021

Signed-off-by: Andy Grove andygrove@nvidia.com

Closes #4229

The following documentation from this PR explains the transpiler change that makes us consistent with CPU for patterns such as [^a].

// There are differences between cuDF and Java handling of newlines
// for negative character matches. The expression `[^a]` will match
// `\r` and `\n` in Java but not in cuDF, so we replace `[^a]` with
// `(?:[\r\n]|[^a])`. We also have to take into account whether any
// newline characters are included in the character range.
//
// Examples:
//
// `[^a]`     => `(?:[\r\n]|[^a])`
// `[^a\r]`   => `(?:[\n]|[^a])`
// `[^a\n]`   => `(?:[\r]|[^a])`
// `[^a\r\n]` => `[^a]`

…epect to newline characters

Signed-off-by: Andy Grove <andygrove@nvidia.com>
@andygrove andygrove added this to the Nov 30 - Dec 10 milestone Dec 1, 2021
@andygrove andygrove self-assigned this Dec 1, 2021
@sameerz sameerz added the task Work required that improves the product but is not user facing label Dec 1, 2021
@andygrove andygrove changed the title WIP: Make regexp pattern [^a] consistent with Spark for multiline strings Make regexp pattern [^a] consistent with Spark for multiline strings Dec 3, 2021
@andygrove andygrove marked this pull request as ready for review December 3, 2021 20:59
@andygrove
Copy link
Contributor Author

build

@jlowe
Copy link
Contributor

jlowe commented Dec 6, 2021

build

@andygrove andygrove merged commit d3c5847 into NVIDIA:branch-22.02 Dec 6, 2021
@andygrove andygrove deleted the neg-class-newline branch December 6, 2021 18:29
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
task Work required that improves the product but is not user facing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] regexp_replace [^a] has different behavior between CPU and GPU for multiline strings
3 participants