-
Notifications
You must be signed in to change notification settings - Fork 1.5k
support to_timestamp with optional chrono formats #8886
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Omega359 I'll go through it tomorrow if no one else beats me
|
||
# verify timestamp data with formatting options | ||
query PPPPPP | ||
SELECT to_timestamp(null, '%+'), to_timestamp(0, '%s'), to_timestamp(1926632005, '%s'), to_timestamp(1, '%+', '%s'), to_timestamp(-1, '%c', '%+', '%s'), to_timestamp(0-1, '%c', '%+', '%s') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alamb as you are the ticket initiator, is it a valid query? I checked https://www.postgresql.org/docs/current/functions-formatting.html I dont see examples like that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These queries are just mirrors of the existing tests that were there for to_timestamp without formatting options.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the key observation here is that this PR implements different semantics than any existing to_timestamp
(it isn't postgres format strings, nor is it spark format strings, it is something datafusion specific based on the rust chrono format strings)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @Omega359 -- I started going through this PR and it looks quite nice 👌
I haven't completed by review of datafusion/physical-expr/src/datetime_expressions.rs
yet but I plan to do so later today or early tomorrow.
cc @waitingkuo @jhorstmann and @waynexia / @gruuya for your comments
use datafusion::prelude::*; | ||
use datafusion_common::assert_contains; | ||
|
||
/// This example demonstrates how to use the to_timestamp function in the DataFrame API as well as via sql. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really nice 👌
|
||
# verify timestamp data with formatting options | ||
query PPPPPP | ||
SELECT to_timestamp(null, '%+'), to_timestamp(0, '%s'), to_timestamp(1926632005, '%s'), to_timestamp(1, '%+', '%s'), to_timestamp(-1, '%c', '%+', '%s'), to_timestamp(0-1, '%c', '%+', '%s') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the key observation here is that this PR implements different semantics than any existing to_timestamp
(it isn't postgres format strings, nor is it spark format strings, it is something datafusion specific based on the rust chrono format strings)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great -- thank you @Omega359 🙏
I think we just need some additional test coverage and this PR is ready to go from my perspective
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks extremely useful, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @alamb I would probably stick to PG standard as we declared. The PR implementation is really cool, and we may want to have a separate function for it? like from_chrono
or something
I had a thought for the future to allow for setting the format based on either config or based on dialect. I was thinking that might be best left to after the function refactor (#8045) is done then have separate implementations if desired. I just didn't want to write a parser for the PG syntax as it's pretty detailed and just wanted to get something out there. Someone else noted the complexity of implementing the PG syntax Do you have happen to have such a parser available? |
… without valid formats.
- Extracted out to_timestamp_impl method to reduce code duplication as per PR feedback. - Extracted out validate_to_timestamp_data_types to reduce code duplication as per PR feedback. - Added additional tests for argument validation and invalid arguments. - Removed unnecessary shim function 'string_to_timestamp_nanos_with_format_shim'
I'm not sure if its available, but we prob use your approach to create an adapter to PG formatting? 🤔 |
It's possible to directly map most of the pg patterns to chrono patterns
but not all. It's a pretty deep rabbit hole - pg supports Julian Day which
chrono doesn't. PG also supports Era (AD/BC) whereas chrono only supports
+- on year. Then there are escape characters of which backslash is allowed (but
now discouraged
<https://www.postgresql.org/docs/16/sql-syntax-lexical.html#SQL-SYNTAX-CONSTANTS>
= S4.1.2.2), you can define your own in unicode strings, and we would
possibly have to handle all of the c-style escape sequences in the format.
And heaven forbid if you have to parse or display any day/month text in
anything but English. Want day 1 in a week to be Sunday in Chrono (numerous
Muslim countries, Israel) ? Nope, you get day 0 - 6 instead (PG does 1-7
correctly).
Or we just keep it simple for now :)
…On Thu, Jan 18, 2024 at 4:58 PM comphead ***@***.***> wrote:
Thanks @alamb <https://github.com/alamb> I would probably stick to PG
standard as we declared. The PR implementation is really cool, and we may
want to have a separate function for it? like from_chrono or something
I had a thought for the future to allow for setting the format based on
either config or based on dialect. I was thinking that might be best left
to after the function refactor (#8045
<#8045>) is done then
have separate implementations if desired. I just didn't want to write a
parser for the PG syntax as it's pretty detailed and just wanted to get
something out there. Someone else noted
<#5398 (comment)>
the complexity of implementing the PG syntax
Do you have happen to have such a parser available?
I'm not sure if its available, but we prob use your approach to create an
adapter to PG formatting? 🤔
So you map PG formatting rules into chrono formatting rules
—
Reply to this email directly, view it on GitHub
<#8886 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AABXHUU5XOZ6ME4SMDBNXJDYPGLI5AVCNFSM6AAAAABB5MLLU6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOJZGI4DQMRVGA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
I'll leave the final decision to @alamb although he already approved. |
…_formats # Conflicts: # datafusion/proto/src/logical_plan/from_proto.rs
I took the liberty of merging up from main and running prettier to get CI clean. |
Thank you again @Omega359 @comphead and @gruuya
I agree in an ideal world,
I agree this would be really nice to add All in all I think this functionality is important enough to provide users ways to parse timestamps with alternate formatting, even if the supported format strings aren't the ideal form. |
I did add a fairly extensive example sql however unless a user knows where to look they won't find see it. Speaking from experience adding more extensive documentation for functions would be very helpful for someone starting up with datafusion (& rust). I'll be happy to update the docs with a brief example and a link to the example sql - perhaps that could be a pattern for all functions? |
That would be amazing @Omega359 -- and really helpful. I have found that when there are good existing patterns subsequent contributors follow them, but getting the initial pattern set is the tough part. |
@Omega359 Do you know why In Postgres, there are two arguments
Can I change the signature to be consistent with Postgres one? datafusion/datafusion/sqllogictest/test_files/timestamps.slt Lines 362 to 366 in 7ebd993
|
Tracked in #13351 |
https://datafusion.apache.org/user-guide/sql/scalar_functions.html#to-timestamp I would be very much opposed to changing this to be exactly like pg. |
Which issue does this PR close?
Closes 5398.
Rationale for this change
Adding flexibility for parsing timestamp strings
What changes are included in this PR?
Code, tests and user guide updates.
Are these changes tested?
Tests were added and pass.
Are there any user-facing changes?
dataframe.to_timestamp(..) changed from
to_timestamp(col("a"))
toto_timestamp(vec![col("a")])