Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

[BUG] Regex: $ does not match \n at end of string #9620

Closed
andygrove opened this issue Nov 5, 2021 · 3 comments · Fixed by #9715
Closed

[BUG] Regex: $ does not match \n at end of string #9620

andygrove opened this issue Nov 5, 2021 · 3 comments · Fixed by #9715
Assignees
Labels
bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. strings strings issues (C++ and Python)

Comments

@andygrove
Copy link
Contributor

Describe the bug
There is an edge case with multiline / EOL matching that seems to be incorrect in cuDF.

Given the pattern 2$ I would expect it to match the inputs 2 and 2\n as seen in Python:

>>> print(re.compile('2$').search("2"))
<re.Match object; span=(0, 1), match='2'>

>>> print(re.compile('2$').search("2\n"))
<re.Match object; span=(0, 1), match='2'>

However, cuDF does not match in the 2\n case.

Steps/Code to reproduce bug

I don't know how to use cuDF from the Python repl with the latest code so I have not actually tested this, but this should be the repro case.

cudf.Series(['2\n']).str.contains('2$', regex=True)

Expected behavior
$ should match EOL even if the input ends with a line terminator.

Environment overview (please complete the following information)
N/A

Environment details
N/A

Additional context
None

@andygrove andygrove added bug Something isn't working Needs Triage Need team to review and classify labels Nov 5, 2021
@davidwendt davidwendt self-assigned this Nov 5, 2021
@davidwendt
Copy link
Contributor

The default behavior of contains changed in #9482 in regards to ^ and $ and the new-line character as explained in that PR. To make ^ and $ match with new-line characters instead of the whole string requires passing the MULTILINE flags parameter.

>>> import cudf
>>> import re
>>> cudf.Series(['2\n']).str.contains('2$', flags=re.MULTILINE, regex=True)
0    True
dtype: bool

The flags are defined here

enum regex_flags : uint32_t {
DEFAULT = 0, /// default
MULTILINE = 8, /// the '^' and '$' honor new-line characters
DOTALL = 16 /// the '.' matching includes new-line characters
};

@andygrove
Copy link
Contributor Author

andygrove commented Nov 8, 2021

I think I could have done a better job explaining this issue. What I am looking for is to have the default cuDF behavior (MULTILINE disabled) match Python's default behavior (also MULTILINE disabled), as described here. The key point is around how $ behaves when the string ends with \n:

By default, '^' matches only at the beginning of the string, and '$' only at the end of the string and immediately before the newline (if any) at the end of the string.

Does that make sense?

@davidwendt
Copy link
Contributor

Ok, got it, thanks. This is some wacky, yet documented edge case.

... By default, '^' matches only at the beginning of the string, and '$' only at the end of the string and immediately before the newline (if any) at the end of the string.

@beckernick beckernick added libcudf Affects libcudf (C++/CUDA) code. strings strings issues (C++ and Python) and removed Needs Triage Need team to review and classify labels Nov 12, 2021
rapids-bot bot pushed a commit that referenced this issue Nov 19, 2021
…#9715)

Closes #9620 

Fixes an edge case described in https://docs.python.org/3/library/re.html#re.MULTILINE
where the '$' EOL regex pattern character (without `MULTILINE` set) should match at the very end of a string and also just before the end of the string if the end of that string contains a new-line.

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Christopher Harris (https://github.com/cwharris)
  - Vukasin Milovanovic (https://github.com/vuule)
  - Sheilah Kirui (https://github.com/skirui-source)

URL: #9715
rapids-bot bot pushed a commit that referenced this issue Dec 10, 2021
Closes #9764 

This reverts the change made in #9620 for the reasons given in #9764 (comment)

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Vyas Ramasubramani (https://github.com/vyasr)
  - Mike Wilson (https://github.com/hyperbolic2346)
  - Andy Grove (https://github.com/andygrove)

URL: #9774
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. strings strings issues (C++ and Python)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants