-
Notifications
You must be signed in to change notification settings - Fork 933
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
[BUG] Regex \n at end of string change in issue 9620 broken string to timestamp behavior #9764
Comments
@davidwendt @andygrove since they were involved previously. |
I'd prefer to revert the change in libcudf.
or in boost regex:
Both these print I also tried it in Java:
This prints |
The Java example in the previous comment is using the method Matcher,matches which "Attempts to match the entire region against the pattern" and has different semantics to the Matcher.find method that Spark uses from its This code sample demonstrates the behavior that led to the initial request to change cuDF's handling here. I'm not sure what the next step should be with this yet but just wanted to document this for completeness. scala> import java.util.regex._
import java.util.regex._
scala> val p = Pattern.compile("abc$")
p: java.util.regex.Pattern = abc$
scala> p.matcher("abc\n").find(0)
res4: Boolean = true |
Thanks Andy. Can we revisit the original request? |
@davidwendt Yes, that makes sense. |
This issue is now resolved. We implemented logic in the plugin regexp transpiler to account for differences in newline behavior between Java/Spark and cuDF to resolve the original issue that prompted the #9620 change which has since been reverted. |
What does this mean for #9774 ? |
Apologies, I thought that had already been merged. We would also like to see this merged. |
Closes #9764 This reverts the change made in #9620 for the reasons given in #9764 (comment) Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Vyas Ramasubramani (https://github.com/vyasr) - Mike Wilson (https://github.com/hyperbolic2346) - Andy Grove (https://github.com/andygrove) URL: #9774
Describe the bug
The code change in #9620 unfortunately affected the behavior for Spark converting string to timestamp.
The previous behavior when a newline was at the end of the string and format was "yyyy-MM-dd", is it would return null and now it actually returns the value with the newline dropped. Spark CPU side is returning a null so the previous behavior is what we expect.
Current broken behavior: (note the string here is actually "1999-12-29\n" and the second column is what is returned converting the string to a timestamp in the format "yyyy-MM-dd")
[#40] CPU: [1999-12-29
,null]
[#40] GPU: [1999-12-29
,1999-12-29]
I'm assuming that change touched a common code patch between regex and string to timestamp.
Steps/Code to reproduce bug
Follow this guide http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports to craft a minimal bug report. This helps us reproduce the issue you're having and resolve the issue more quickly.
Expected behavior
GPU should return null when format "yyyy-MM-dd" and string to timestamp called with "1999-12-29\n"
I reverted the commit from the PR and tested it and it does return the expected value on GPU side:
scala> res2.show()
+------------+----+
| c0| c1|
+------------+----+
|1999-12-29\n|null|
+------------+----+
Environment overview (please complete the following information)
docker pull
&docker run
commands usedEnvironment details
Please run and paste the output of the
cudf/print_env.sh
script here, to gather any other relevant environment detailsAdditional context
Add any other context about the problem here.
The text was updated successfully, but these errors were encountered: