Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

NA's from date fields when reading JSON's #121

Closed
jjhall77 opened this issue Feb 21, 2017 · 14 comments · Fixed by #123
Closed

NA's from date fields when reading JSON's #121

jjhall77 opened this issue Feb 21, 2017 · 14 comments · Fixed by #123
Assignees
Milestone

Comments

@jjhall77
Copy link

When I use read.socrata on a link with a json, all of the date fields are returned as NA's.

url <- "https://data.cityofnewyork.us/resource/xx67-kt59.json" df <- read.socrata(url)

The dates are populated in the actual json file.

@tomschenkjr tomschenkjr self-assigned this Feb 21, 2017
@tomschenkjr
Copy link
Contributor

I've replicated the error with the data set above with the most recent version. However, I was not able to produce the problem on similar data set with timestamps. The issue may be related to #68, but I will investigate a bit further to determine the cause.

@jjhall77 - In the meantime, the following commands will work by not specifying the JSON file. The dates appear properly formatted in R.

url <- "https://data.cityofnewyork.us/Health/DOHMH-New-York-City-Restaurant-Inspection-Results/xx67-kt59"
df <- read.socrata(url)

@tomschenkjr
Copy link
Contributor

The underlying problem is the date format for https://data.cityofnewyork.us/resource/xx67-kt59.json is YYYY-MM-DDTHH:MM:SS. This format is not being converted to POSIX appropriately as shown by:

> posixify("2017-02-21T00:00:00")
[1] NA

However, the date format in my other example, https://data.cityofchicago.org/resource/cwig-ma7x.json is formatted YYYY-MM-DDTHH:MM:SS.sss (with subseconds) is successfully converted, for example:

> posixify("2016-02-18T00:00:00.000")
[1] "2016-02-18 CST"

However, this type of format is not used in the CSV files, so that will successfully work. When the file format is not specified in the request, it defaults to CSV.

We will patch this bug and release it on GitHub, which can be installed using devtools package. The next release is slated for March.

@tomschenkjr
Copy link
Contributor

After looking at the code and realizing we've previously had similar errors with #68 and #106, I'm going to try to improve the error handling for this with the following steps:

  • Add ability to handle date/time formats as YYYY-MM-DDTHH:MM:SS (the source of this issue)
  • We will be checking for 4 different date/time formats after this modification. I'll add a specific warning if a date/time format is not properly converted in case there is a future format that we don't have (e.g., YYYY-MM-DDTHH:MM:SS.sssssss, or other country formats).
  • If the date/time format is not properly converted, have RSocrata return the data as chr strings instead of NA. This will let the user manually convert if necessary and will reduce the severity of errors like these in the future.

@geneorama
Copy link
Member

In my experience it's tricky to get R to print out subseconds. You might find this code useful:
options(digits.secs = 3)

This allows you to see three decimal places which are otherwise hidden. For example:

> Sys.time()
[1] "2017-02-23 11:16:32 CST"
> options(digits.secs = 3)
> Sys.time()
[1] "2017-02-23 11:16:38.409 CST"
> 

@tomschenkjr
Copy link
Contributor

@geneorama - displaying the decimal seconds was not the issue in this case. It was anything in ISO 8601 without decimal seconds that was causing the error.

@geneorama
Copy link
Member

geneorama commented Feb 24, 2017 via email

@geneorama
Copy link
Member

@tomschenkjr Commenting as of b0ce826

For the most part the function's doing what I would expect, but I noticed something that I didn't expect.

The fractional part of the second doesn't seem to be captured, and it doesn't round the result. For example:

> posixify("2012-09-14T22:14:21.000")
[1] "2012-09-14 22:14:21 CDT"
> posixify("2012-09-14T22:14:21.999")
[1] "2012-09-14 22:14:21 CDT"

Is this expected behavior?

@tomschenkjr
Copy link
Contributor

tomschenkjr commented Mar 3, 2017 via email

@geneorama
Copy link
Member

The options didn't matter, I set the digits to 3 and it still didn't show the fractional part of the second. My point in the example above is that even .999 gets rounded / shown as 0.

options(digits.secs = 3)
posixify("2012-09-14T22:14:21.999")
# [1] "2012-09-14 22:14:21 CDT"

unclass(posixify("2012-09-14T22:14:21.999"))
# [1] 1347678861
# attr(,"tzone")
# [1] ""

Sys.time()
# [1] "2017-03-03 17:03:58.753 CST"

@geneorama
Copy link
Member

The example above shows that the fractional time isn't coming through, even though it could be there.

@tomschenkjr
Copy link
Contributor

tomschenkjr commented Mar 3, 2017

To clarify, R does not appear to round decimal times and just truncates it and is similar to what RSocrata also does, so I think that would be expected behavior.

z <- Sys.time()
z
# [1] "2017-03-03 17:14:45 CST"
options(digits.secs = 3)
z
# [1] "2017-03-03 17:14:45.517 CST"

@geneorama
Copy link
Member

When printing the time off Sys.time() you get fractional seconds, if you have digits.secs set in R's options.
When using posisify fractional seconds are dropped.

options(digits.secs = 3)
posixify("2012-09-14T22:14:21.999")
# [1] "2012-09-14 22:14:21 CDT"

According to the help for strptime the argument format:

A character string. The default for the format methods is "%Y-%m-%d %H:%M:%S" if any element has a time component which is not midnight, and "%Y-%m-%d" otherwise. If options("digits.secs") is set, up to the specified number of digits will be printed for seconds.

The help implies that if you have your options set, the format function will use the decimal portion of the time given, however this doesn't seem to be the case

> options(digits.secs = 3)
> as.POSIXct("2012-09-14T22:14:21.999", format = "%Y-%m-%dT%H:%M:%S")
[1] "2012-09-14 22:14:21 CDT"

Also, the help implies that you shouldn't need to specify the format because the default is "%Y-%m-%dT%H:%M:%S". However, that doesn't seem to work as expected:

> as.POSIXct("2012-09-14T22:14:21.999")
[1] "2012-09-14 CDT"

It bothers me that there would be a loss of data for fractional seconds, but there doesn't seem to be a good way around it so I'm accepting the pull request.

@PriyaDoIT PriyaDoIT modified the milestone: 1.7.2 Mar 7, 2017
@geneorama
Copy link
Member

geneorama commented Mar 7, 2017

As far as I can tell the cleanest way to fix posixify to keep the factional seconds is to replace the T in the middle of the string with a space, then format the string without specifying a format.

## Conversion without setting digits.secs (NULL is the default)
> options(digits.secs = NULL) 
> as.POSIXct("2012-09-14 22:14:21.999")
[1] "2012-09-14 22:14:21 CDT"  # Fractional second is lost, not hidden (you can check with unclass)

## Conversion WITH setting digits.secs set 
> options(digits.secs = 3)
> as.POSIXct("2012-09-14 22:14:21.999")
[1] "2012-09-14 22:14:21.999 CDT"
## Note that even with digits.secs set using a "format" argument will lop off the fractional second.
> as.POSIXct("2012-09-14 22:14:21.999", format = "%Y-%m-%d %H:%M:%S")
[1] "2012-09-14 22:14:21 CDT"

What I would do if we have fractional seconds present in the data

  1. Check the user's settings to see if they have set digits.secs
  2. If yes, then do the import (after using gsub or sub to replace T with a space)
  3. If it's NULL or not set to 3, then warn the user that they are going to lose sub second accuracy and suggest setting options(digits.secs = 3)

As discussed I'm going to accept the current pull request as is so that we can move forward with other issues.

@geneorama
Copy link
Member

Also note, I don't know of a data set that has subseconds in it. I didn't see any in the NY food inspections above.

This made me wonder if some of the times had subsecond values and some time values didn't. If this is possible, then we should add a test for mixes of subsecond and non-subsecond JSON vectors, e.g.:
posixify(c("2012-09-14T22:14:21.001", "2012-09-14T22:14:21"))

geneorama added a commit that referenced this issue Mar 7, 2017
Fixes #121

I added several comments in the original issue #121
nicklucius pushed a commit that referenced this issue Mar 8, 2017
* Added testing for issue #121 -- will fail

* Changed to camel case to accomodate future variable renaming

* Added handling of ISO 8601 timestamps without fractional seconds. See issue #121

* Increased minimum version of R because of `anyNA` function.

* Unformatted dates kept as chr; added warning; added tests

* Added tests for #124 -- will fail

* Handles downloadURL from ls.socrata. Fixes #124
nicklucius added a commit that referenced this issue Apr 12, 2017
* Added unit tests for issue #118

* Add unit test for #120

* Bug fix - fixes #120

* Update version number

* Fixes #124 (#125)

* Added testing for issue #121 -- will fail

* Changed to camel case to accomodate future variable renaming

* Added handling of ISO 8601 timestamps without fractional seconds. See issue #121

* Increased minimum version of R because of `anyNA` function.

* Unformatted dates kept as chr; added warning; added tests

* Added tests for #124 -- will fail

* Handles downloadURL from ls.socrata. Fixes #124

* Closes #129

* Iteration version

* startsWith() requires R >= 3.3.0

* Resolves #133. Also documents per #132

* Updated to reflect consequence of #133

* resolved merge conflict from rebasing

* fixing test - dataset is live so rows can increase, and fixed data types.

* add unit test for issue #118 - will fail

* fixes #118

* Add .gitattributes, ignores it during R build. Closes #135

* Fixes #134

* Clarified language on installing from GitHub

* Removed penultimate release from build testing. Closes #136

* Updated formatting for sections

* Update NEWS.md for 1.7.2 release.

* Included remaining changes to NEWS
# for free to join this conversation on GitHub. Already have an account? # to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants