-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Add initial spec for handling dates and times #35
Conversation
Everything seems sound. I've no comments. |
Quick question for the front end devs: Right now, I have the spec specifying using 'AD' or 'BC' to denote the era in all cases for consistency. However, it's often omitted for recent dates (e.g., the last few hundred years or so). My thinking is that the API should be consistent there, and the UI can omit the era information if it wants. Does this work for you all? Edit: Also, I stuck the era at the end of both date and date/time combo strings. My thinking on that is that this will be easiest to strip off if it's not wanted. On the other hand, it's slightly inconsistent, since it means the string from the |
1. Users should be able to interact with dates, times, datetimes, and time durations in Mathesar in the most human-friendly way possible | ||
2. The UI should be extremely forgiving w.r.t. date/time input, and also display dates and times in a human-friendly way. | ||
3. The API should be extremely forgiving w.r.t. date/time input, but should return dates, times, and time durations in a more programmatic, i.e., consistent way than the UI. | ||
4. The UI and API should be consistent in the following way: If the input string is understood by both UI and API, they should always result in the _same date/time_ being stored, regardless of the input method. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the input string is understood by both UI and API, they should always result in the same date/time being stored, regardless of the input method.
Agreed!
I want to highlight a related conversation in which I said:
Some situations may have edge cases which create difficulty in making the client's results consistent with the server's results.
checking if something is an ISO datetime shouldn't be that hard
I know we're not critiquing implementation yet, but I just want to offer a thought for people to consider early-on during implementation...
If we had a Node backend, reusing the same parsing code across client and server would be trivial, and I'd be inclined to push quite a bit of validation logic to the front end, for date/time and other stuff too. However, with our front end and back end using different languages, it's not trivial. I'm wary of implementing separate parsers (or even simple validators) in python and TypeScript, because as this document demonstrates, there are edge cases to consider.
The first thing I'd do here is to look for a library which implements a parser that we can run on both the client and server. I haven't spent time to research this, but here's an example of a small library I've played with before (not related to dates/times) that runs the same code in Python and a browser (via wasm). If such a library already exists for parsing ISO 8601 and/or RFC 3339, then I'd be inclined to pursue it a bit. If we did decide to use a 3rd-party library, then all these specs would become subject to its implemenation details -- which is why I raise this point now, instead of during our own implementation phase.
@mathemancer, have you looked for any 3rd-party libs that already do this parsing?
If a suitable 3rd-party library doesn't already exist, then I'm a bit torn about how I'd want to proceed. Personally, I'd be quite interested in writing one in Rust, but that's likely more than work that we want to bite off right now. I would accept an approach where we this logic in Python and TypeScript, if others are really set on it. But if everything were up to me, I'd probably begin by delegating full responsibility to the back-end for all the date/time parsing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of problems arise if we try to do parsing in python. Notably, parsing into python date / time types will break for:
- years with more than 4 digits
- years b.c.e.
- durations given with at least 30 days (python's
timedelta
always assumes 30 days is a month). - durations given with at least 24 hours (python's
timedelta
always assumes 24 hours is a day; not true around DST changes).
A lot of what I'm doing right now is avoiding letting python have anything to do with date and/or time parsing.
So, I'm currently doing all parsing in PostgreSQL. It's more sophisticated, and also guarantees consistency with what's actually being stored for obvious reasons.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A lot of what I'm doing right now is avoiding letting python have anything to do with date and/or time parsing
Ok I'm confused then. Can you help me understand? For example, if the user imports a CSV file containing a string that we'd like to parse into a date, where would the code exist that would do that parsing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Arrow, a python library for parsing DateTime is quite similar to moment.js
, so we could consider using it if necessary. We are already using it to validate timestamp-related display options. But I am not sure if we should be parsing on the service layer as most of the python libraries would be converting it to a python based timedelta
which have issues with the precision.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm currently doing all parsing in PostgreSQL
Ok I didn't see this edit you made when I wrote my previous comment. Have you considered using something like plv8 and writing all the parsing logic in JavaScript? I've never done this but I've heard good things about it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What language are you currently planning to use to write the parsing logic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the UI should understand date/time while the user is entering it. The only purpose in this case would be to perform validations, and I think validations should always be done on the server for this.
- We intend to allow the user to enter any string which may be a date or time. Eg., 'yesterday', 'an hour ago'.
- We would probably want to extend these in the future (with more such common strings or even additional languages).
- We can only do that without conflict if we did this in one place: either the server or the client and it makes most sense to do it on the server.
After the user enters it, the UI will absolutely need to understand valid date/time strings. This would be possible as the server will always respond with a well-defined format as mentioned further down in the spec. This can be used for any operation the UI needs as well as for formatting the strings based on display options.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What language are you currently planning to use to write the parsing logic?
We are not writing any parsing logic ourselves on the backend; PostgreSQL has excellent parsing logic built in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I didn't see this edit you made when I wrote my previous comment. Have you considered using something like plv8 and writing all the parsing logic in JavaScript? I've never done this but I've heard good things about it.
I'd need to do anything custom in PL/PGSQL (or just SQL). The reason is that we can't count on being able to install extensions. Some users may be using managed PosgreSQL instances (e.g., Amazon's RDS) where installing extensions isn't possible.
```python | ||
f'{year}-{month}-{day} {era}' | ||
``` | ||
Here, `year` will have a minimum of 4 digits, using preceding zeroes for years like 0042 b.c. Years with more than 4 digits have the correct number of digits. `month` is a 2-digit integer, as is `day` (with preceding zeroes as needed). `era` will always be included, and will be either `BC` or `AD` as appropriate. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here,
year
will have a minimum of 4 digits
I'm happy to see that we'll be Y10k compliant 😁
|
||
#### Regarding relative times | ||
|
||
There are some relative dates / times which make sense, e.g. "tomorrow", to store in a date column. These will _not_ be stored in a "relative" way, though. They'll be stored as a static date or time (one day later than the current day in the example). As for durations, there's some weird overlap between intervals "2 days ago" and a specific time "2 days ago". We'll choose whether to store the duration or a static time based on the column type. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'll choose whether to store the duration or a static time based on the column type.
This is very cool!
I imagine we might eventually want to support date/time parsing for contexts in which we don't have a column type. For example, say I want to create a filter condition that selects all records with date_added
within the past 2 days. I that case, I think we'd want to honor the fact the the user entered a duration. That ought to be obvious during implementation, but I just wanted to point this out for others to consider.
- `'1 year 1 month, 1 day 1 hour 1 minute 1.1 seconds'` maps to `P1Y1M1DT1H1M1.1S` | ||
- `'0 years 13 months 31 days 23 hours 60 minutes 61.1 seconds'` maps to `P1Y1M31DT24H1M1.1S` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are helpful examples of our expected input/output.
From reading these two examples, I infer:
- Commas do not matter
- Pluralization of words does not matter
Is this correct?
It might also be nice to strip the string and
from the input, so that users can enter 1 year and 1 month
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's correct (at least for the DB-layer parsing). As for stripping "and", we might be able to do something for single inserts, but I don't think it would be very simple (with the default INTERVAL
type) for COPY
-based bulk imports. We could probably add something as a feature at some point, but the implementation would be quite difficult.
Sounds good! |
Here's an additional edge case to consider too, which I don't fully understand https://en.wikipedia.org/wiki/Year_zero#ISO_8601 Maybe not worth thinking about more, though I'm not sure. |
Here's another thing I just realized might be missing from this spec (though perhaps the scope of the spec is more narrow than I'm imagining). In contexts where we accept a date, the user should be able to enters a string like "1/7/2022", right? What about "6 Aug, 1870"? What about "7/2" (inferring the current year)? There are lots of variations here obviously, and a good deal of ambiguity too, which requires knowledge about locale to resolve. For example, "1/7/2022" might become "2022-01-07" in USA but "2022-07-01" in many European countries. These situations are tricky. Should this spec deal with those situations too? |
|
||
## Overall input flow | ||
|
||
If the UI is able to parse a date, time, duration, etc., it should reformat it into the appropriate canonical form described below and send that as the input to be stored to the API. If it's not able to parse the input, it should send the raw input string to the API and let the service layer / database try to parse the string. This helps satisfy goal number 5. If the service layer is able to parse the string, then store it. Otherwise, raise a validation error. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let the service layer / database try to parse the string. This helps satisfy goal number 5. If the service layer is able to parse the string, then store it. Otherwise, raise a validation error.
I am not sure if we should attempt to do the parsing on the service layer, as most of the python libraries would be converting it to a python based timedelta
which has an issue with the precision.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By "service layer", I'm mostly meaning the database. Perhaps I should be clearer there. We're going to strictly avoid using any python date / time types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mathemancer Looks great overall. Two changes requested:
(1) I think we need documentation (or a link to documentation) about the formats that PostgreSQL supports while parsing dates, times, and durations so that we can make sure the frontend is consistent with it. Ideally, what frontend library we use should also be part of this spec but we can get to that once we have the requirements. It should also help resolve @seancolsen's question here.
(2) See my comment on line 21.
|
||
## Overall input flow | ||
|
||
If the UI is able to parse a date, time, duration, etc., it should reformat it into the appropriate canonical form described below and send that as the input to be stored to the API. If it's not able to parse the input, it should send the raw input string to the API and let the service layer / database try to parse the string. This helps satisfy goal number 5. If the service layer is able to parse the string, then store it. Otherwise, raise a validation error. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the frontend should be compelled to send raw input to the API if it can't parse it, and seems unnecessary given goal 5. Someone might enter "pizza" in a datetime field, we don't need the backend to know that's not a datetime.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After reading @pavish's comment and thinking about it more, it seems like the frontend wants to rely entirely on the backend for validation. I think this is fine for now, which makes my comment above irrelevant.
I do think we should eventually look into whether we can do validation on the frontend (to make the UX snappier), but we can update this spec if/when it comes to that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My current thinking is to do the "default" parsing in the DB (and therefore for the API), and then the front end could extend that bit by bit, and add validation iteratively. But if we assume that the DB will need to attempt parsing whatever it's handed from the beginning, it takes pressure off of adding any intelligence in the front end for awhile.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good.
2. The UI should be extremely forgiving w.r.t. date/time input, and also display dates and times in a human-friendly way. | ||
3. The API should be extremely forgiving w.r.t. date/time input, but should return dates, times, and time durations in a more programmatic, i.e., consistent way than the UI. | ||
4. The UI and API should be consistent in the following way: If the input string is understood by both UI and API, they should always result in the _same date/time_ being stored, regardless of the input method. | ||
5. The dates/times understood by the API should be a subset of those understood by the UI. That is, there should be no dates/times/time durations that are understood by the API, but which can't be entered through the UI. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need to restrict what is accepted by the API based on the UI understanding user input.
understood by the UI
The UI needs only to understand dates & times after the user enters and saves it, in which scenario the backend would provide the same in a well-defined format. See my comment above.
The UI needs to understand date and time but it does not need to understand user input.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could definitely be convinced to keep the UI layer pretty "thin" for this. That would certainly make keeping things consistent easier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, @pavish I just realized the source of confusion. By "understood by the UI", I don't mean it needs to be handled by code at the UI layer. I mean that from a user perspective, the UI should accept all input that the API accepts. This implies avoiding invalidating some string that the API could have accepted. I did want to leave the possibility open that the UI could have some fancy parsing that would allow accepting more input styles, and then parse them into RFC 3339 compliant strings to submit to the API.
I'll think about how to word that more clearly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mathemancer I have some concerns about goal 5 which I've mentioned in different comments. Apart from that, everything looks good to me.
These all parse correctly using the default PostgreSQL handling except "7/2", which is just too ambiguous. As far as the locale-sensitive ones, they'll be parsed according to the database locale (which is the |
@mathemancer That fact that we're leaning on parsing functionality that's already built into Postgres is a pretty big piece of the picture which I completely failed to see upon first reading of the spec. Thanks for clarifying. With that additional knowledge, my opinion to delegate validation and parsing responsibility to the back-end is further strengthened. And it seems like we have a rough consensus on that now too.
I'm curious to play with this a bit.
|
No special SQL is necessary, you can insert the values as-is. See https://www.db-fiddle.com/f/gijuboYyi7k5fRX5ABnwba/0 |
I spent a little time playing with PostgreSQL's parsing using queries like Comparing it the date parsing in Google Sheets, I only found a few inconsistencies:
We'll need to be aware of these limitations and document them. For example if a user exports a spreadsheet to CSV which contains the date |
I don't think we need to do this – different software handle dates and times in different ways and I don't think it's feasible for us to keep track of them all. We also don't necessarily want to be consistent, see Scientists rename human genes to stop Microsoft Excel from misreading them as dates. I think we should assume the user is responsible for the format they use when exporting their CSVs and not try to reconcile different parsers. Also, the default representation of dates in Google Sheets is fairly unambiguous. I tried formatting a column in Google Sheets with values |
@kgodey Sorry to harp on this a bit. My opinions are diminishing in strength as we get further into the weeds, but I want to offer a few clarifying points. I'm not arguing for any modifications to our strategy. I'm just trying to share some information with others as food-for-thought while considering our date parsing.
|
No apologies are necessary! I'm glad that you clarified what you were saying.
You're correct, I somehow missed that they were stored as dates during my initial Google Sheets testing.
For imports, It does matter how our date parsing works, since we suggest a data type for a column based on whether we can parse it.
I thought you were suggesting that we do an exhaustive check of how our parsing differs from common spreadsheet tools and document every case; I don't think that's worth putting time into (and it doesn't seem like that's what you were actually suggesting).
Agreed. Mathesar won't let you set columns with that data to a date and that's good.
I do too. This is where a datetime parsing library on the frontend could come in handy, we could parse user input into ISO format datetimes. The tricky thing here would be to make sure that the frontend parsing is a superset of the backend parsing or be okay with different parsing on the frontend/backend. I would like to explore frontend parsing after the alpha release but as you've pointed out, it's good enough for now.
Agreed. At some point (well after alpha), we could even check with the user how they want us to process 2 digit dates when we detect them. |
Good point. Fortunately strings like
I'd still rather handle this on the back end so that the parsing logic works identically for user input and imports. We could add a very thin layer of input cleaning before the string is sent to Postgres. Within that layer, we'd do some simple regex transformation using very strict patterns, and we'd only apply the transform when the column type is import re
def expand_shorthand_time(input: str) -> str:
return re.sub(re.compile(r'(?i)\b(1[12]|0?\d)(\s*[ap]m)\b'), r'\1:00\2', input)
def test_expand_shorthand_time():
e = expand_shorthand_time
# Matches
assert e('3pm') == '3:00pm'
assert e('11pm') == '11:00pm'
assert e('12pm') == '12:00pm'
assert e('1 AM') == '1:00 AM'
assert e('07 am') == '07:00 am'
assert e('from 11 am to 12 pm and beyond') == 'from 11:00 am to 12:00 pm and beyond'
# Non-matches
assert e('111 pm') == '111 pm' # too many digits
assert e('13 pm') == '13 pm' # number > 12
assert e('3 amigos') == '3 amigos' # no word boundary
Good idea. We could also use some regex logic to detect these kinds of dates and issue a warning to the user explaining the situation. |
Unfortunately we can't do this for imports under our current architecture. We don't read the CSV in Python at all, we just pass it directly to Postgres (which has built-in CSV import functionality). We could do it for individually entered data, either on the frontend or the backend. I think it would be okay for us to support more parsing formats on the frontend than during import, it's more important that people be able to add data easily/quickly when they're doing it manually. |
Ahh I see. That makes sense. |
Clarify where parsing happens, add more examples
@mathemancer I'll review in more detail tomorrow (when I'm actually working) but I think it would make sense to document (probably in an appendix-type place) the parsing differences between Google Sheets (or spreadsheets in general) that @seancolsen found. |
If we want to extend the set of understandable date/time input strings in the future, then the UI should parse the more exotic input strings into the appropriate canonical form described below and send that as the input to be stored to the API. Care should be taken to ensure that _if the UI parses the string_, then it's either | ||
- a string which wouldn't be handled by the database parsing, or | ||
- the parsing results in an identical value being stored as the one that would have resulted from sending the raw string to the API. | ||
|
||
If the UI is not able to parse the input, it should still send the raw input string to the API and let the data layer try to parse the string. This helps satisfy goal number 5. If the data layer is able to parse the string, then store it. Otherwise, raise a validation error. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the UI should parse the more exotic input strings into the appropriate canonical form
This statement is at odds with my understanding of the resolution of our discussion. With the "6 pm" case as an example, I don't want to fully parse that into canonical form. I'd rather only do the minimum amount of work necessary to convert it into a format that Postgres can understand -- in this case "6:00 pm" because that's much simpler and thus has less risk of violating goal 4.
I suggest replacing lines 23-27 with:
We have identified strings like "6 pm" as having a format which Postgres cannot parse but which we would like to accept as user input. There may be other such formats too. For the sake of task prioritization, we will not attempt to accept such strings before the alpha release. If we want to extend the set of understandable date/time input strings in the future then we will evaluate the best implementation on a case-by-case basis. For the "6 pm" format, we would likely implement a thin string transformation in the UI layer, converting the string to "6:00 pm" (without fully parsing it into canonical form). With this implementation, "6 pm" would only be accepted in the UI, and not through imports, which would be okay.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A complete rewrite of lines 23-27 seems unnecessary, I think the detail is useful.
I do agree that the UI shouldn't need to parse it into the canonical representation, just an API-understood representation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also think the "6 pm" example would be good to include in addition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good point. My intent with the "canonical representation" is more of a restriction on API output, and also a sort of "guaranteed input" that conforms to a well-known RFC, and by extension should be possible to achieve using libraries to avoid custom work in the UI layer. That being the case, I'll reword this to make the "canonical form" a non-restrictive option for input. I was mostly hoping to avoid the UI implementer needing to read any PostgreSQL docs to figure out what forms would be acceptable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel quite confident that we're all on the same page about everything now, so I'm a little hesitant to continue with critique. But for the sake of clarity in the spec though, I made a small suggestion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mathemancer See comments below. I reiterate that I think it would make sense to document (probably in an appendix-type place) the parsing differences between Google Sheets (or spreadsheets in general) that @seancolsen found and posted in this comment.
Looks good overall, I don't need to re-review before merge.
If we want to extend the set of understandable date/time input strings in the future, then the UI should parse the more exotic input strings into the appropriate canonical form described below and send that as the input to be stored to the API. Care should be taken to ensure that _if the UI parses the string_, then it's either | ||
- a string which wouldn't be handled by the database parsing, or | ||
- the parsing results in an identical value being stored as the one that would have resulted from sending the raw string to the API. | ||
|
||
If the UI is not able to parse the input, it should still send the raw input string to the API and let the data layer try to parse the string. This helps satisfy goal number 5. If the data layer is able to parse the string, then store it. Otherwise, raise a validation error. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A complete rewrite of lines 23-27 seems unnecessary, I think the detail is useful.
I do agree that the UI shouldn't need to parse it into the canonical representation, just an API-understood representation.
If we want to extend the set of understandable date/time input strings in the future, then the UI should parse the more exotic input strings into the appropriate canonical form described below and send that as the input to be stored to the API. Care should be taken to ensure that _if the UI parses the string_, then it's either | ||
- a string which wouldn't be handled by the database parsing, or | ||
- the parsing results in an identical value being stored as the one that would have resulted from sending the raw string to the API. | ||
|
||
If the UI is not able to parse the input, it should still send the raw input string to the API and let the data layer try to parse the string. This helps satisfy goal number 5. If the data layer is able to parse the string, then store it. Otherwise, raise a validation error. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also think the "6 pm" example would be good to include in addition.
This PR adds a spec to the wiki showing how we'll deal with dates, times, date/time combinations, and time durations.