Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

[BUG] Regex \n at end of string change in issue 9620 broken string to timestamp behavior #9764

Closed
tgravescs opened this issue Nov 23, 2021 · 8 comments · Fixed by #9774
Closed
Assignees
Labels
bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS strings strings issues (C++ and Python)

Comments

@tgravescs
Copy link
Contributor

Describe the bug
The code change in #9620 unfortunately affected the behavior for Spark converting string to timestamp.

The previous behavior when a newline was at the end of the string and format was "yyyy-MM-dd", is it would return null and now it actually returns the value with the newline dropped. Spark CPU side is returning a null so the previous behavior is what we expect.

Current broken behavior: (note the string here is actually "1999-12-29\n" and the second column is what is returned converting the string to a timestamp in the format "yyyy-MM-dd")

[#40] CPU: [1999-12-29
,null]
[#40] GPU: [1999-12-29
,1999-12-29]

I'm assuming that change touched a common code patch between regex and string to timestamp.

Steps/Code to reproduce bug
Follow this guide http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports to craft a minimal bug report. This helps us reproduce the issue you're having and resolve the issue more quickly.

Expected behavior
GPU should return null when format "yyyy-MM-dd" and string to timestamp called with "1999-12-29\n"

I reverted the commit from the PR and tested it and it does return the expected value on GPU side:

scala> res2.show()
+------------+----+
| c0| c1|
+------------+----+
|1999-12-29\n|null|
+------------+----+

Environment overview (please complete the following information)

  • Environment location: [Bare-metal, Docker, Cloud(specify cloud provider)]
  • Method of cuDF install: [conda, Docker, or from source]
    • If method of install is [Docker], provide docker pull & docker run commands used

Environment details
Please run and paste the output of the cudf/print_env.sh script here, to gather any other relevant environment details

Additional context
Add any other context about the problem here.

@tgravescs tgravescs added bug Something isn't working Needs Triage Need team to review and classify libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS strings strings issues (C++ and Python) labels Nov 23, 2021
@tgravescs
Copy link
Contributor Author

@davidwendt @andygrove since they were involved previously.

@davidwendt
Copy link
Contributor

I'd prefer to revert the change in libcudf.
Doing some more research, I found this edge case does not appear in the C++ std::regex:

#include <iostream>
#include <string>
#include <regex>
int main() {
    std::string s2 = "abc\n";
    std::regex self_regex("abc$");
    std::cout << std::boolalpha << std::regex_search(s2, self_regex) << "\n";
    return 0;
}

or in boost regex:

#include <boost/regex.hpp>
#include <string>
#include <iostream>
int main() {
  std::string s2 = "abc\n";
  boost::regex expr{"abc$"};
  std::cout << std::boolalpha << boost::regex_match(s2, expr) << '\n';
  return 0;
}

Both these print false.
I'd rather libcudf be aligned with these libraries than with Python.

I also tried it in Java:

import java.util.regex.*;  
public class Test1{  
public static void main(String args[]){  
   System.out.println( Pattern.compile("abc$").matcher("abc\n").matches() );
}}

This prints false as well.

@andygrove
Copy link
Contributor

The Java example in the previous comment is using the method Matcher,matches which "Attempts to match the entire region against the pattern" and has different semantics to the Matcher.find method that Spark uses from its RLike expression.

This code sample demonstrates the behavior that led to the initial request to change cuDF's handling here. I'm not sure what the next step should be with this yet but just wanted to document this for completeness.

scala> import java.util.regex._
import java.util.regex._

scala> val p = Pattern.compile("abc$")
p: java.util.regex.Pattern = abc$

scala> p.matcher("abc\n").find(0)
res4: Boolean = true

@davidwendt
Copy link
Contributor

Thanks Andy. Can we revisit the original request?
I feel the breakage here in this issue would be more difficult to work around then the original issue.

@andygrove
Copy link
Contributor

@davidwendt Yes, that makes sense.

@davidwendt davidwendt self-assigned this Dec 7, 2021
@andygrove
Copy link
Contributor

This issue is now resolved. We implemented logic in the plugin regexp transpiler to account for differences in newline behavior between Java/Spark and cuDF to resolve the original issue that prompted the #9620 change which has since been reverted.

@davidwendt
Copy link
Contributor

What does this mean for #9774 ?
I'd still like to merge this. Would this change break you?

@andygrove
Copy link
Contributor

Apologies, I thought that had already been merged. We would also like to see this merged.

@andygrove andygrove reopened this Dec 10, 2021
rapids-bot bot pushed a commit that referenced this issue Dec 10, 2021
Closes #9764 

This reverts the change made in #9620 for the reasons given in #9764 (comment)

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Vyas Ramasubramani (https://github.com/vyasr)
  - Mike Wilson (https://github.com/hyperbolic2346)
  - Andy Grove (https://github.com/andygrove)

URL: #9774
@bdice bdice removed the Needs Triage Need team to review and classify label Mar 4, 2024
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS strings strings issues (C++ and Python)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants