Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

[BUG] strings::from_timestamp can overflow on valid timestamps #9790

Closed
revans2 opened this issue Nov 29, 2021 · 4 comments · Fixed by #9793
Closed

[BUG] strings::from_timestamp can overflow on valid timestamps #9790

revans2 opened this issue Nov 29, 2021 · 4 comments · Fixed by #9793
Assignees
Labels
bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. strings strings issues (C++ and Python)

Comments

@revans2
Copy link
Contributor

revans2 commented Nov 29, 2021

Describe the bug
When I try to convert some microsecond values to strings. scale_time appears to overflow and I get the wrong answer out.

Steps/Code to reproduce bug
This is Spark code, but I will try to get a repro case in C++ shortly.

spark.range(128849018879999995L, 128849018880000003L).selectExpr("*", "timestamp_micros(id) as ts").repartition(1).selectExpr("*", "minute(ts)").show(truncate = false)

+------------------+--------------------------+----------+
|id                |ts                        |minute(ts)|
+------------------+--------------------------+----------+
|128849018879999995|6053-01-23 02:07:59.999995|7         |
|128849018879999996|6053-01-23 02:07:59.999996|7         |
|128849018879999997|6053-01-23 02:07:59.999997|7         |
|128849018879999998|6053-01-23 02:07:59.999998|7         |
|128849018879999999|6053-01-23 02:07:59.999999|7         |
|128849018880000000|6053-01-23 02:52:00       |8         |
|128849018880000001|6053-01-23 02:52:00.000001|8         |
|128849018880000002|6053-01-23 02:52:00.000002|8         |
+------------------+--------------------------+----------+

Essentially when we get to 128849018880000000 which is the micro second timestamp value. get_time_components is called to find the hour, min, sec, and sub-sec parts of the timestamp. If it is micro-seconds, then the micro-second part is calculate and the timestamp is converted to seconds.
128849018880000000L / 1000000 => 21474836480

Then scale_time(tstamp, 60) is called, to convert the timestamp into mins instead of seconds. But 21474836480/60 is 2147483648, and scale_time returns an int32_t, so it overflows (Int.MaxValue is 2147483647).

Expected behavior
We get the correct answer back.

@revans2 revans2 added bug Something isn't working Needs Triage Need team to review and classify labels Nov 29, 2021
@davidwendt davidwendt self-assigned this Nov 29, 2021
@revans2
Copy link
Contributor Author

revans2 commented Nov 29, 2021

diff --git a/cpp/src/strings/convert/convert_datetime.cu b/cpp/src/strings/convert/convert_datetime.cu
index 51a6a796ba..8d0c5704a7 100644
--- a/cpp/src/strings/convert/convert_datetime.cu
+++ b/cpp/src/strings/convert/convert_datetime.cu
@@ -707,9 +707,9 @@ struct from_timestamp_base {
    *     scale( 61,60) ->  1
    * @endcode
    */
-  __device__ int32_t scale_time(int64_t time, int64_t base) const
+  __device__ int64_t scale_time(int64_t time, int64_t base) const
   {
-    return static_cast<int32_t>((time - ((time < 0) * (base - 1L))) / base);
+    return (time - ((time < 0) * (base - 1L))) / base;
   };
 
   __device__ time_components get_time_components(int64_t tstamp) const

Appears to fix the issue. I'll try to turn it into a PR with some tests.

@revans2
Copy link
Contributor Author

revans2 commented Nov 29, 2021

@davidwendt sorry feel free to take this if you want.

@davidwendt
Copy link
Contributor

@revans2 This change you proposed here looks good. I thought it might be a more involved change but this looks correct. Feel free to post a PR with this change.

@davidwendt davidwendt assigned revans2 and unassigned davidwendt Nov 29, 2021
@davidwendt davidwendt added libcudf Affects libcudf (C++/CUDA) code. strings strings issues (C++ and Python) labels Nov 29, 2021
@revans2
Copy link
Contributor Author

revans2 commented Nov 29, 2021

Will do

@rapids-bot rapids-bot bot closed this as completed in #9793 Dec 1, 2021
rapids-bot bot pushed a commit that referenced this issue Dec 1, 2021
This fixes #9790 

When converting a timestamp to a String it is possible for the %M min calculation to overflow an int32_t part way through casting. This moves that result to be an int64_t which avoids the overflow issues.

Authors:
  - Robert (Bobby) Evans (https://github.com/revans2)

Approvers:
  - Bradley Dice (https://github.com/bdice)
  - David Wendt (https://github.com/davidwendt)

URL: #9793
@bdice bdice removed the Needs Triage Need team to review and classify label Mar 4, 2024
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. strings strings issues (C++ and Python)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants