fast_strptime() doesn't allow whitespace that base::strptime() allows #911

kenahoo · 2020-07-27T22:54:16Z

I wanted to switch some code at my company from base::strptime() to lubridate::fast_strptime(), because the %z in the latter understands ISO 8601 offsets like ±[hh]:[mm] and ±[hh][mm], whereas base::strptime()'s %z only understands the ±[hh][mm] format, without the colon.

However, this ended up causing problems, because base::strptime() allows varying whitespace in between the day and hour portion of the input, whereas lubridate::fast_strptime() does not.

lubridate::fast_strptime("8/1/2012  8:02:51.397000 AM", "%m/%d/%Y %I:%M:%OS %p")
#> [1] NA
lubridate::fast_strptime( "8/1/2012 8:02:51.397000 AM", "%m/%d/%Y %I:%M:%OS %p")
#> [1] "2012-08-01 08:02:51 UTC"
                strptime("8/1/2012  8:02:51.397000 AM", "%m/%d/%Y %I:%M:%OS %p")
#> [1] "2012-08-01 08:02:51 UTC"
                strptime( "8/1/2012 8:02:51.397000 AM", "%m/%d/%Y %I:%M:%OS %p")
#> [1] "2012-08-01 08:02:51 UTC"

^{Created on 2020-07-27 by the reprex package (v0.3.0)}

The colon-including format is the default (only?) format used by Python's pandas.Timeseries.isoformat(), and I wanted to consume some data emitted by pandas, so that led me to fast_strptime(). The stricter whitespace would likely break other stuff elsewhere in our code, though, so I'm not sure I can change.

It would be great if this small change in behavior between the two functions could be reconciled, or if not, at least noted in the docs as an intentional difference, for the wary.

The text was updated successfully, but these errors were encountered:

kenahoo · 2020-07-27T23:15:19Z

Ooh, I get the privilege of submitting bug #911. ;-)

vspinu · 2020-07-28T18:25:54Z

I will have a look into fixing this. For now you can use parse_date_time2 which is almost equivalent to fast_strptime except it skips separators (not only spaces).

> lubridate::parse_date_time2("8/1/2012  8:02:51.397000 AM", "mdY IMOSp")
[1] "2012-08-01 08:02:51.39 UTC"

kenahoo · 2020-07-28T19:13:17Z

Thanks Vitalie, will do. I wasn't sure how that would handle a T separator between the date & time, but it looks like that works fine.

kenahoo · 2020-10-01T21:54:45Z

Hi Vitalie, any sense of when this might be released? I'm using a workaround in my code right now that does x <- gsub(" +", " ", x), but I'd rather put a condition on it like:

  if (packageVersion('lubridate') < '1.7.10')
    x <- gsub("  +", " ", x)  # See https://github.com/tidyverse/lubridate/issues/911

I decided against using parse_date_time2() because sometimes I want to explicitly require the T separator, and I can't see how to do that without using exact=TRUE, which takes away the flexible-whitespace behavior.

kenahoo · 2022-01-03T19:14:14Z

Hi @vspinu - any update on this?

vspinu · 2022-01-11T21:57:48Z

@kenahoo sorry for ditching this. Spaces are ignored in fast_strptime now. Thanks for bumping it up!

kenahoo · 2022-01-13T17:42:34Z

Hi @vspinu , no problem! I just tried the latest version from github and indeed it fixes the problems in my tests. Thanks for the fix.

kenahoo · 2022-01-13T17:52:16Z

I noticed that now whitespace is allowed between any tokens whatsoever, so the following happens - is that intentional?

lubridate::fast_strptime('2019 -    10-24   T    00:00   +1300', "%Y  -  %m  -  %dT%H:%M%z")
#> [1] "2019-10-23 11:00:00 UTC"
lubridate::fast_strptime('2019-10-24T00:00+1300', "%Y  -  %m  -  %dT%H:%M%z")
#> [1] "2019-10-23 11:00:00 UTC"
lubridate::fast_strptime('2019 -    10-24   T    00:00   +1300', "%Y-%m-%dT%H:%M%z")
#> [1] "2019-10-23 11:00:00 UTC"

strptime('2019 -    10-24   T    00:00   +1300', "%Y  -  %m  -  %dT%H:%M%z")
#> [1] NA
strptime('2019-10-24T00:00+1300', "%Y  -  %m  -  %dT%H:%M%z")
#> [1] "2019-10-23 11:00:00 UTC"
strptime('2019 -    10-24   T    00:00   +1300', "%Y-%m-%dT%H:%M%z")
#> [1] NA

^{Created on 2022-01-13 by the reprex package (v2.0.1)}

vspinu · 2022-01-13T18:04:23Z

yes, that's intentional. Why would middle space be special?

kenahoo · 2022-01-13T18:56:04Z

I'm just pointing out that whereas my original case was about varying nonzero amounts of whitespace, i.e. treating 1 or 2 space characters the same, the new code also treats 0 and 1 space characters the same. That seems like a bigger change and possibly contrary to many people's expectations.

As shown above, it also doesn't sync up very well with strptime's behavior.

vspinu · 2022-01-13T20:03:31Z

I see what you mean. space in format should match one or more spaces, no space in format should not match. This makes sense.

… format

vspinu · 2022-01-13T21:37:59Z

Fixed.

kenahoo · 2022-01-14T23:14:06Z

Cool - looks more like what I expected:

lubridate::fast_strptime('2019 -    10-24   T    00:00   +1300', "%Y  -  %m  -  %dT%H:%M%z")
#> [1] NA
lubridate::fast_strptime('2019-10-24T00:00+1300', "%Y  -  %m  -  %dT%H:%M%z")
#> [1] "2019-10-23 11:00:00 UTC"
lubridate::fast_strptime('2019 -    10-24   T    00:00   +1300', "%Y-%m-%dT%H:%M%z")
#> [1] NA

strptime('2019 -    10-24   T    00:00   +1300', "%Y  -  %m  -  %dT%H:%M%z")
#> [1] NA
strptime('2019-10-24T00:00+1300', "%Y  -  %m  -  %dT%H:%M%z")
#> [1] "2019-10-23 11:00:00 UTC"
strptime('2019 -    10-24   T    00:00   +1300', "%Y-%m-%dT%H:%M%z")
#> [1] NA

^{Created on 2022-01-14 by the reprex package (v2.0.1)}

@DavisVaughan

Version 1.9.2 ============= ### BUG FIXES * [#1104](tidyverse/lubridate#1104) Fix incorrect parsing of months when %a format is present. ### OTHER * Adapt to internal name changes in R-devel Version 1.9.1 ============= ### NEW FEATURES * `as_datetime()` accepts multiple formats in format argument, just like `as_date()` does. ### BUG FIXES * [#1091](tidyverse/lubridate#1091) Fix formatting of numeric inputs to parse_date_time. * [#1092](tidyverse/lubridate#1092) Fix regression in `ymd_hm` on locales where `p` format is not defined. * [#1097](tidyverse/lubridate#1097) Fix `as_date("character")` to work correctly with formats that include extra characters. * [#1098](tidyverse/lubridate#1098) Roll over the month boundary in `make_dateime()` when units exceed their maximal values. * [#1090](tidyverse/lubridate#1090) timechange has been moved from Depends to Imports. Version 1.9.0 ============= ### NEW FEATURES * `roll` argument to updating and time-zone manipulation functions is deprecated in favor of a new `roll_dst` parameter. * [#1042](tidyverse/lubridate#1042) `as_date` with character inputs accepts multiple formats in `format` argument. When `format` is supplied, the input string is parsed with `parse_date_time` instead of the old `strptime`. * [#1055](tidyverse/lubridate#1055) Implement `as.integer` method for Duration, Period and Interval classes. * [#1061](tidyverse/lubridate#1061) Make `year<-`, `month<-` etc. accessors truly generic. In order to make them work with arbitrary class XYZ, it's enough to define a `reclass_date.XYZ` method. * [#1061](tidyverse/lubridate#1061) Add support for `year<-`, `month<-` etc. accessors for `data.table`'s IDate and ITime objects. * [#1017](tidyverse/lubridate#1017) `week_start` argument in all lubridate functions now accepts full and abbreviated names of the days of the week. * The assignment value `wday<-` can be a string either in English or as provided by the current locale. * Date rounding functions accept a date-time `unit` argument for rounding to a vector of date-times. * [#1005](tidyverse/lubridate#1005) `as.duration` now allows for full roundtrip `duration -> as.character -> as.duration` * [#911](tidyverse/lubridate#911) C parsers treat multiple spaces as one (just like strptime does) * `stamp` gained new argument `exact=FALSE` to indicate whether `orders` argument is an exact strptime formats string or not. * [#1001](tidyverse/lubridate#1001) Add `%within` method with signature (Interval, list), which was documented but not implemented. * [#941](tidyverse/lubridate#941) `format_ISO8601()` gained a new option `usetz="Z"` to format time zones with a "Z" and convert the time to the UTC time zone. * [#931](tidyverse/lubridate#931) Usage of `Period` objects in rounding functions is explicitly documented. ### BUG FIXES * [#1036](tidyverse/lubridate#1036) `%within%` now correctly works with flipped intervals * [#1085](tidyverse/lubridate#1085) `as_datetime()` now preserves the time zone of the POSIXt input. * [#1072](tidyverse/lubridate#1072) Names are now handled correctly when combining multiple Period or Interval objects. * [#1003](tidyverse/lubridate#1003) Correctly handle r and R formats in locales which have no p format * [#1074](tidyverse/lubridate#1074) Fix concatination of named Period, Interval and Duration vectors. * [#1044](tidyverse/lubridate#1044) POSIXlt results returned by `fast_strptime()` and `parse_date_time2()` now have a recycled `isdst` field. * [#1069](tidyverse/lubridate#1069) Internal code handling the addition of period months and years no longer generates partially recycled POSIXlt objects. * Fix rounding of POSIXlt objects * [#1007](tidyverse/lubridate#1007) Internal lubridate formats are no longer propagated to stamp formater. * `train` argument in `parse_date_time` now takes effect. It was previously ignored. * [#1004](tidyverse/lubridate#1004) Fix `c.POSIXct` and `c.Date` on empty single POSIXct and Date vectors. * [#1013](tidyverse/lubridate#1013) Fix c(`POSIXct`,`POSIXlt`) heterogeneous concatenation. * [#1002](tidyverse/lubridate#1002) Parsing only with format `j` now works on numeric inputs. * `stamp()` now correctly errors when no formats could be guessed. * Updating a date with timezone (e.g. `tzs = "UTC"`) now returns a POSIXct. ### INTERNALS * `lubridate` is now relying on `timechange` package for update and time-zone computation. Google's CCTZ code is no longer part of the package. * `lubridate`'s updating logic is now built on top of `timechange` package. * Change implementation of `c.Period`, `c.Duration` and `c.Interval` from S4 to S3. Version 1.8.0 ============= ### NEW FEATURES * [#960](tidyverse/lubridate#960) `c.POSIXct` and `c.Date` can deal with heterogeneous object types (e.g `c(date, datetime)` works as expected) ### BUG FIXES * [#994](tidyverse/lubridate#994) Subtracting two duration or two period objects no longer results in an ambiguous dispatch note. * `c.Date` and `c.POSIXct` correctly deal with empty vectors. * `as_datetime(date, tz=XYZ)` returns the date-time object with HMS set to 00:00:00 in the corresponding `tz` ### CHANGES * [#966](tidyverse/lubridate#966) Lubridate is now built with cpp11 (contribution of @DavisVaughan)

vspinu closed this as completed in 5b9eec8 Jan 11, 2022

vspinu added a commit that referenced this issue Jan 13, 2022

[#911] fast_strptime skip multiple spaces only if space is present in…

53e5892

… format

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fast_strptime() doesn't allow whitespace that base::strptime() allows #911

fast_strptime() doesn't allow whitespace that base::strptime() allows #911

kenahoo commented Jul 27, 2020 •

edited

Loading

kenahoo commented Jul 27, 2020

vspinu commented Jul 28, 2020

kenahoo commented Jul 28, 2020

kenahoo commented Oct 1, 2020

kenahoo commented Jan 3, 2022

vspinu commented Jan 11, 2022

kenahoo commented Jan 13, 2022

kenahoo commented Jan 13, 2022 •

edited

Loading

vspinu commented Jan 13, 2022

kenahoo commented Jan 13, 2022

vspinu commented Jan 13, 2022

vspinu commented Jan 13, 2022

kenahoo commented Jan 14, 2022

fast_strptime() doesn't allow whitespace that base::strptime() allows #911

fast_strptime() doesn't allow whitespace that base::strptime() allows #911

Comments

kenahoo commented Jul 27, 2020 • edited Loading

kenahoo commented Jul 27, 2020

vspinu commented Jul 28, 2020

kenahoo commented Jul 28, 2020

kenahoo commented Oct 1, 2020

kenahoo commented Jan 3, 2022

vspinu commented Jan 11, 2022

kenahoo commented Jan 13, 2022

kenahoo commented Jan 13, 2022 • edited Loading

vspinu commented Jan 13, 2022

kenahoo commented Jan 13, 2022

vspinu commented Jan 13, 2022

vspinu commented Jan 13, 2022

kenahoo commented Jan 14, 2022

kenahoo commented Jul 27, 2020 •

edited

Loading

kenahoo commented Jan 13, 2022 •

edited

Loading