Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

[BUG] GpuRegExExtract is not align with RegExExtract #5135

Closed
sperlingxx opened this issue Apr 2, 2022 · 0 comments · Fixed by #5136
Closed

[BUG] GpuRegExExtract is not align with RegExExtract #5135

sperlingxx opened this issue Apr 2, 2022 · 0 comments · Fixed by #5136
Assignees
Labels
bug Something isn't working

Comments

@sperlingxx
Copy link
Collaborator

Describe the bug

  1. In Spark, the pattern of regexp_extract isn't required to fully match the input string. Just as what extract means, regexp_extract finds any substring which matches the pattern and extracts them. However, GpuRegExExtract requires a full match on pattern.
val df =Seq("1a", "2a", "3a", "4a", "5a", "6a", "7a", "8a", "9a", "10a").toDF("c")
df.coalesce(1).select(regexp_extract(col("c"), "(a)", 1)).collect()

GPU result: Array([], [], [], [], [], [], [], [], [], [])
CPU result: Array([a], [a], [a], [a], [a], [a], [a], [a], [a], [a])

  1. When group index is 0, GpuRegExExtract doesn't behave correctly.

CPU run: regexp_extract('123abcEfg', '([0-9]+)[a-z]+([A-Z])', 0) => 123abcE
CPU run: regexp_extract('123abcEfg', '([0-9]+)[a-z]+([A-Z])', 0) => 123abcEfg

BTW, the issue is originated from #5088.

@sperlingxx sperlingxx added bug Something isn't working ? - Needs Triage Need team to review and classify labels Apr 2, 2022
@sperlingxx sperlingxx self-assigned this Apr 2, 2022
@mattahrens mattahrens removed the ? - Needs Triage Need team to review and classify label Apr 5, 2022
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants