Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

bind -> rbind.fill; getMetadata; GeoJSON method #56

Merged
merged 2 commits into from
Oct 26, 2015

Conversation

dmpe
Copy link
Contributor

@dmpe dmpe commented Sep 7, 2015

Functions to speed up the row binding

So, to begin with, there are 4 (to me) known functions which can be used to bind rows together. These are:

  • rbind from base
    This is is known to be slow with large datasets. For example, see this: http://rpubs.com/wush978/6302
  • rbind.fill from plyr
    A faster version of rbind, implemented in R. And the only other function which can be used for our purpose.
  • bind_rows from dplyr: This is a faster (C-implemented) version of rbind.fill from plyr. The only significant change for us is the fact that it is far more stricter in terms of types which can be permitted in the column. While plyr "doesn't seem to care much" and will "just do it", dplyr will check it and throw an error if there is e.g. an integer and character in the same column. There is already a bug filled, which has been closed Feature request - rbind_all type coercion: 'Just do it' option? tidyverse/dplyr#1162 with hadley saying
    I'm currently happy with the behaviour of bind_rows(), and if you'd prefer the rbindlist() behaviour, you're free to use that.
    Thus, as we cannot really on it (and from my own experience working with dplyr), this package would be out of touch.
  • rbindlist from data.table:
    I'm hesitant of data.table. I'm not a huge fan of their slightly off-kilter data frame. data.table is fast, but I worry they just aren't going to get a critical mass of users to make it well-known to many R programmers.
    Yes, I fully agree on this and this is also a reason why i have decided not to apply it here. If we have decided to do it, then we would make it in part backwards incompatible (and would be required to run a marketing campaign to our users :) )

Tests

I have commited the benchmarking file (Rmd) here without appropriate functions because otherwise it would be a huge mess. Here is a gist for them: https://gist.github.com/dmpe/5aec87f0c7a5ae2115ca

  • Benchmark rbind @ google
    • A very good one from 2014: http://www.exegetic.biz/blog/2014/06/concatenating-a-list-of-data-frames/
    • With our sample data: http://rpubs.com/F789GH/BenchmarkRowsBinding
      • While with the first one, I cannot apply dplyr because it will print an error, with the second one there is no such a problem. This just confirms the issue with dplyr I have described above. Again, with the third one below, dplyr will fail again.
      • Looking at mean in seconds: 1. datatable; 2. plyr; 3 . base; 4. dplyr (? what ?)
      • Looking at median in seconds: 1. datatable; 2. dplyr; 3 . plyr; 4. base (This already makes sense)
      • Summary: Given that dplyr is complicated and data.table is a no-go, the only options is here either to use plyr or stick with base rbind. Due to the small (and with larger datasets bigger) speed up, I decided to choose plyr. But this can be - of course - further discussed. BTW, I also think and actually saw that it will be largely a dataset dependent, i.e. with some smaller datasets rbind would be faster, while with some other plyr would be better.

Third one:

big <- microbenchmark(
  read.socrataRBIND("https://data.ny.gov/resource/hrvs-fxs2.csv"),
  read.socrataPLYR("https://data.ny.gov/resource/hrvs-fxs2.csv"),
  read.socrataDPLYR("https://data.ny.gov/resource/hrvs-fxs2.csv"),
  read.socrataDATATABLE("https://data.ny.gov/resource/hrvs-fxs2.csv"),
  times = 3L,
  unit = "s"
)
big
Unit: seconds
                                                                expr      min       lq     mean   median       uq      max neval cld
     read.socrataRBIND("https://data.ny.gov/resource/hrvs-fxs2.csv") 17.24245 17.43700 17.80954 17.63156 18.09309 18.55463     3   a
      read.socrataPLYR("https://data.ny.gov/resource/hrvs-fxs2.csv") 17.39023 17.41183 17.43587 17.43343 17.45869 17.48395     3   a
 read.socrataDATATABLE("https://data.ny.gov/resource/hrvs-fxs2.csv") 16.90609 16.93185 17.24587 16.95761 17.41576 17.87392     3   a

This will fail at dplyr again; without it: 1. datatable; 2. plyr; 3. base

Detecting file output (geojson, csv, json)

Excellent! I wonder, though, if this should be a different method, we detect it using the extension, or use an option within read.socrata. For instance, here are alternative scenarios:

Yes, I though about that when creating this patch set.

The reason why I have decided for splitting GEO method was, that the method read.socrata would be otherwise really huge.
Given your geojson choice, the method geojson_read from geojsonio package offers several arguments such as method,parse, what and these must be accessible for the user. Thus you would do something like

read.socrata <- function(url = NULL, app_token = NULL, limit = 50000, domain = NULL, fourByFour = NULL, query = NULL, offset = 0, output = c("csv", "json", "geojson"), ...) {}
where ... would be used for geojson_read arguments.
OR
read.socrata <- function(url = NULL, app_token = NULL, limit = 50000, domain = NULL, fourByFour = NULL, query = NULL, offset = 0, output = c("csv", "json", "geojson"), method = "something", parse = "something", what="something") {}
However, the last 3 will apply only to the "geojson" choice. One would document that in the documentation, but just looking at such function I would be confused.

The current read.socrataGEO method is small and very clear to the user as it deals only with the GEO stuff. While the other one, is used for CSV & JSON stuff exclusively.

But, look, I agree that this is largely a personal preference and nothing stops us in doing what Tom has suggested. This is also a reason why I wanted some feedback on my changes.

Changed behaviour

  • I have put application token into the header of httr::GET() request in the errorHandling() method, so we can speare dealing with $$token= somewhere in the code. Thus, tests have been edited for this change
  • I have switched to json only (kind of) and thus added new validation logic.

@dmpe dmpe changed the title geojson-big New set of commits [see email] Sep 7, 2015
@@ -19,6 +19,7 @@ r_github_packages:
- jeroenooms/curl
- klutometis/roxygen
- jimhester/covr
- yihui/mime
- ropensci/geojsonio
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We now also require mime version >= 4.0 which has been published to CRAN
https://cran.rstudio.com/web/packages/mime/index.html

@dmpe dmpe changed the title New set of commits [see email] New set of commits [see email] [rbind -> rbind.fill; getMetadata; GeoJSON meth. ] Sep 8, 2015
@dmpe dmpe changed the title New set of commits [see email] [rbind -> rbind.fill; getMetadata; GeoJSON meth. ] bind -> rbind.fill; getMetadata; GeoJSON method Sep 8, 2015
@@ -1,34 +1,46 @@
# An interface to data hosted online in Socrata data repositories
# This is the main file which uses other functions to download data from a Socrata repositories
#
# Author: Hugh J. Devlin, Ph. D. 2013-08-28
# Author: Hugh J. Devlin, Ph.D et al.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to be fixed later
DONE

@dmpe
Copy link
Contributor Author

dmpe commented Sep 8, 2015

With this PR, following issues can be marked as closed:
#43
#19
#27
#33

@dmpe
Copy link
Contributor Author

dmpe commented Sep 11, 2015

Hi @tomschenkjr & @geneorama ,
could you please take a look on these preliminary changes. Thank you

}

# http://stackoverflow.com/a/7964098
# For TRANSITION: it will be easier for users of CSV to translate to JSON by warning them.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: transit

@dmpe dmpe force-pushed the dev-geojson-big branch 3 times, most recently from ca89a33 to 0982a6b Compare September 30, 2015 10:11
@tomschenkjr
Copy link
Contributor

@dmpe - Quite a few notes to get through.

At a high-level, I think it's important to remember two key goals of the package. Both of which basically sum-up to making it "easy" to work with Socrata URLs:

  1. Allow users to copy/paste either a valid SoDA URI or "human-readable" URL without having to handle the web call itself (e.g., paging) to put it into a convenient storage type
  2. Have the package translate an handle data types behind-the-scenes to be convenient in R (e.g., dates)

There are three primary interactions with RSocrata:

A. Use a valid SoDA API asking for a JSON file: e.g., https://data.example.com/resource/four-four.json
B. Use a valid SoDA API asking for a CSV file: e.g., https://data.example.com/resource/four-four.csv
C. Use a human readable API e.g., https://data.example.com/Human-Readable-Name/four-four
D. Proposed: Use a human readable GeoJSON output.

Finally, in the spirit of semantic versioning, we should be mindful of introducing anything that is not backward compatible for the end-user. Everything "behind-the-scenes" can be reworked more frequently to find the better solution. Breaking backward compatibility should be delayed if not entirely avoided.

The above items should be our central tenants. As long as these are met, then I'm good with any modification.

So, keeping that in mind with the comment above, here are my thoughts:

  • I don't mind which method is used for binding rows, as long as it works. data.table seems to violate (1) given it's odd data storage. Otherwise, whatever is fastest works.
  • For file output, we should avoid overthinking this. Basically, it should meet criteria (A), (B), and (C). Behind-the-scenes doesn't matter. So, I think it's overcomplicating by asking the user. We should just make the three primary interactions work. So, we should remove user-options on this and optimize the end-user experience by making our own choice.
  • For read.socrataGeo, your explanation makes sense. However, let me ask a question-- does the user really need to access parse, method and what, or is this something we can choose? I ask because, since this is Socrata, all of those options may be the same regardless of the specific dataset.
  • I think the switch to JSON only is not bad because it's a "backend" work. However, I think the biggest issue is the lack of backwards compatibility with human-readable URLs. This breaks the spirit of semantic versioning.

@dmpe
Copy link
Contributor Author

dmpe commented Oct 6, 2015

Hi Tom,
Thanks for the reply. See my answers below.

I don't mind which method is used for binding rows, as long as it works. data.table seems to violate (1) given it's odd data storage. Otherwise, whatever is fastest works.

Plyr

For file output, we should avoid overthinking this. Basically, it should meet criteria (A), (B), and (C). Behind-the-scenes doesn't matter. So, I think it's overcomplicating by asking the user. We should just make the three primary interactions work. So, we should remove user-options on this and optimize the end-user experience by making our own choice.

I think the switch to JSON only is not bad because it's a "backend" work.

Yep, no option for the user. Everything will be done in JSON.

For read.socrataGeo, your explanation makes sense. However, let me ask a question-- does the user really need to access parse, method and what, or is this something we can choose? I ask because, since this is Socrata, all of those options may be the same regardless of the specific dataset.

We can certainly have this:
results <- geojsonio::geojson_read(url, method = "local", what = "list", parse = FALSE). Then, the method would have only two arguments, that is the URL and ....

However, I think the biggest issue is the lack of backwards compatibility with human-readable URLs. This breaks the spirit of semantic versioning.

Fail. I wrote something where text and code are saying two different things. In fact, the current changes are backwards compatible! What it does is that it takes a http://soda.demo.socrata.com/dataset/USGS-Earthquake-Reports/4334-bgaj.csv and replaces it with a SODA one.

Sorry for confusion! I deleted that previous message above.

In case, you are going to download data using human-readable URL, with a csv suffix you will get this:

BEWARE: Your suffix is no longer supported. Thus, we will automatically replace it with JSON.
BEWARE: RSocrata no longer supports human readable URLs, thus we will convert it to the SODA 
            ones (with resource/xxxx-xxxx.json) instead.

@dmpe dmpe force-pushed the dev-geojson-big branch from b385086 to 3bb2c20 Compare October 6, 2015 22:40
@dmpe
Copy link
Contributor Author

dmpe commented Oct 20, 2015

Hi, @tomschenkjr Any update on this ? I believe i have addressed your points. But if not, please let me know.
Thanks.

@tomschenkjr
Copy link
Contributor

Looks good. Pulling into the dev branch.

tomschenkjr pushed a commit that referenced this pull request Oct 26, 2015
bind -> rbind.fill; getMetadata; GeoJSON method
@tomschenkjr tomschenkjr merged commit b8df56d into Chicago:dev Oct 26, 2015
@dmpe
Copy link
Contributor Author

dmpe commented Oct 26, 2015

Thank you !

@tomschenkjr
Copy link
Contributor

FYI, I have a small project that I'm running the dev version and hitting some issues that I'll be posting on. In the meantime, here is the code I'm running:

ls_chicago <- ls.socrata("https://data.cityofchicago.org")
four_by_four <- substr(ls_chicago$identifier, 42, 50)

read.socrata(url = "http://data.cityofchicago.org", 
             fourByFour = four_by_four[1])

Resulting in:

Error in validateUrl(url) : 
   is not a valid Socrata dataset unique identifier.

Also, the unit testing seemed to fail, but can resolve those shortly.

@dmpe
Copy link
Contributor Author

dmpe commented Oct 26, 2015

Thanks, I saw that too.
I will take a look into the first thing and you can take the second one.

@dmpe
Copy link
Contributor Author

dmpe commented Oct 26, 2015

Ok, for one, you need to replace

read.socrata(url = "http://data.cityofchicago.org", 
             fourByFour = four_by_four[1])

with

 read.socrata(domain = "http://data.cityofchicago.org", 
               fourByFour = four_by_four[1])

Did it work always ? Because I don't know that we can even do so. URL parameters always - I believe - consisted of the whole, complete URL

@tomschenkjr
Copy link
Contributor

Got it, that's correct. Sorry, my transcription error.

The error I had meant to post was

Error in curl::curl_fetch_memory(url, handle = handle) : 
  Couldn't resolve host name

@dmpe
Copy link
Contributor Author

dmpe commented Oct 26, 2015

I got this error too. BUT at random, non reproducible. I have no idea how to deal with it. Plus, what does that even mean, because asd <- httr::GET("https://data.cityofchicago.org/resource/kc9i-wq85.json", httr::add_headers("X-App-Token" = NULL)) is for me at least a non issue ?

@dmpe
Copy link
Contributor Author

dmpe commented Oct 26, 2015

Plus it is case by case basis. Some work without the error, while others not.

I will soon provide you a link (not a PR), which will get some things better (but not this):
Tom, please pull this here dmpe@95de70b
(upon request can make a PR)

Maybe something like https://stackoverflow.com/questions/19641808/r-error-in-connection-to-host
https://stackoverflow.com/questions/17783686/solution-how-to-install-github-when-there-is-a-proxy ?
Sorry, cannot test it now.

@tomschenkjr
Copy link
Contributor

Ok, this seems to be the offending line.

I'm submitting a couple of corrections to dev after I check out a few more things and write a couple of tests to try to catch these errors in the future.

I'm also going to remove the warnings being thrown on human-readable URL support. Since it's not being deprecated, and it's a backend adjustment that doesn't directly impact the code, just need to note it in release notes.

@tomschenkjr
Copy link
Contributor

Pushed changes to the repo. Took awhile to get this into build into a package, so need to keep that in order.

Having trouble building the examples, so need to insert more \dontrun{} around the examples. I opened #60

I also had to remove the vignettes to build the package. Opened #59 to note their addition again.

The new geojson feature will be great. Will need some tweaks but will be able to get into 1.7.

@dmpe
Copy link
Contributor Author

dmpe commented Oct 27, 2015

@tomschenkjr ad Problem with build URL: Please, take also a look on https://cran.r-project.org/web/packages/urltools/index.html as I am more and more now with the university stuff and have less and less time to boot VM and work on Rsocrata.
If we could replace httr 2 methods I would be highly in favor.

@ I also had to remove the vignettes to build the package. Opened #59 to note their addition again.
This will need to be fixed in the following release. We must have it on CRAN. In any case, I will help.

@natecobb natecobb mentioned this pull request Oct 27, 2015
@dmpe
Copy link
Contributor Author

dmpe commented Oct 28, 2015

PS: https://github.com/Chicago/RSocrata/blob/dev/R/returnData.R#L102
This domain = "http://data.cityofchicago.org" should be also now documented that domain = "http://data.cityofchicago.org/" (with / at the end) will not work properly.

dmpe added a commit to dmpe/RSocrata that referenced this pull request Dec 24, 2015
Merge pull request Chicago#51 from Chicago/dev

Pull request for v1.6.1
Remove NEWS.md from build to avoid NOTE

Build number bump and date change

R 3.2.1 wants all URLs in canonical form--has been changed

Fix type-o in Coveralls badge
Merge remote-tracking branch 'upstream/dev' into dev

split & improve 4x4 logic (tests pass)
bump to 1.6.3
Add new Floatin Timestamp format for posixify + add test

split validate + test

validate url split test

fix travis

don't run, not do not run
fix travis again

add errorHandling function
and rename getResponse to checkResponse
update test

else if [skip ci]

add not finished test for Chicago#27, Chicago#24
[skip ci]

go over sprint7 branch [skip ci]

text/plain [skip ci]

fix paste -> paste0 because of sep="" which I deleted

delete note.md + update date [skip ci]

geo readme

Merge remote-tracking branch 'upstream/dev' into dev

Merge remote-tracking branch 'origin/dev' into dev

Conflicts:
	DESCRIPTION

add geojson example & vignette (dependencies are suggested, later to be moved to required)

should fix travis

now fix again

second time the same.

give up

give up on travis. wont work

?

fix last time

again

tests and comments (geo too)
update functions, move 4x4 to utils.R
update docu + .md files

add geojson support. not finished

Merge pull request Chicago#53 from dmpe/dev

Split functions and other smaller improvements
fix leaflet example, not SP object but the list and add a new contributor
fix test [skip ci]

Merge pull request Chicago#56 from dmpe/dev-geojson-big

bind -> rbind.fill; getMetadata; GeoJSON method
Fixed version number scheme

Using building numbers, x.y.z-b. The "z" should only be incremented on bug releases being planned.
Human-readable URLs are not being deprecated

Fixes build url under certain domain cases

Re-added human-readable URL

Added unit tests for broken-out URLs

Removed unnecessary library listings

Cleaned-up documentation

Increment build info

Updated help files

Removed vignettes, updated documentation for style

Example formatting issues, cleaned-up comments

Turned on tests

Examples on read.socrata taking too long, stopped their run

Merge pull request #4 from Chicago/dev

Dev
dmpe added a commit to dmpe/RSocrata that referenced this pull request Dec 24, 2015
Merge pull request Chicago#51 from Chicago/dev

Pull request for v1.6.1
Remove NEWS.md from build to avoid NOTE

Build number bump and date change

R 3.2.1 wants all URLs in canonical form--has been changed

Fix type-o in Coveralls badge
Merge remote-tracking branch 'upstream/dev' into dev

split & improve 4x4 logic (tests pass)
bump to 1.6.3
Add new Floatin Timestamp format for posixify + add test

split validate + test

validate url split test

fix travis

don't run, not do not run
fix travis again

add errorHandling function
and rename getResponse to checkResponse
update test

else if [skip ci]

add not finished test for Chicago#27, Chicago#24
[skip ci]

go over sprint7 branch [skip ci]

text/plain [skip ci]

fix paste -> paste0 because of sep="" which I deleted

delete note.md + update date [skip ci]

geo readme

Merge remote-tracking branch 'upstream/dev' into dev

Merge remote-tracking branch 'origin/dev' into dev

Conflicts:
	DESCRIPTION

add geojson example & vignette (dependencies are suggested, later to be moved to required)

should fix travis

now fix again

second time the same.

give up

give up on travis. wont work

?

fix last time

again

tests and comments (geo too)
update functions, move 4x4 to utils.R
update docu + .md files

add geojson support. not finished

Merge pull request Chicago#53 from dmpe/dev

Split functions and other smaller improvements
fix leaflet example, not SP object but the list and add a new contributor
fix test [skip ci]

Merge pull request Chicago#56 from dmpe/dev-geojson-big

bind -> rbind.fill; getMetadata; GeoJSON method
Fixed version number scheme

Using building numbers, x.y.z-b. The "z" should only be incremented on bug releases being planned.
Human-readable URLs are not being deprecated

Fixes build url under certain domain cases

Re-added human-readable URL

Added unit tests for broken-out URLs

Removed unnecessary library listings

Cleaned-up documentation

Increment build info

Updated help files

Removed vignettes, updated documentation for style

Example formatting issues, cleaned-up comments

Turned on tests

Examples on read.socrata taking too long, stopped their run

Merge pull request #4 from Chicago/dev

Dev
dmpe added a commit to dmpe/RSocrata that referenced this pull request Dec 24, 2015
Merge pull request Chicago#51 from Chicago/dev

Pull request for v1.6.1
Remove NEWS.md from build to avoid NOTE

Build number bump and date change

R 3.2.1 wants all URLs in canonical form--has been changed

Fix type-o in Coveralls badge
Merge remote-tracking branch 'upstream/dev' into dev

split & improve 4x4 logic (tests pass)
bump to 1.6.3
Add new Floatin Timestamp format for posixify + add test

split validate + test

validate url split test

fix travis

don't run, not do not run
fix travis again

add errorHandling function
and rename getResponse to checkResponse
update test

else if [skip ci]

add not finished test for Chicago#27, Chicago#24
[skip ci]

go over sprint7 branch [skip ci]

text/plain [skip ci]

fix paste -> paste0 because of sep="" which I deleted

delete note.md + update date [skip ci]

geo readme

Merge remote-tracking branch 'upstream/dev' into dev

Merge remote-tracking branch 'origin/dev' into dev

Conflicts:
	DESCRIPTION

add geojson example & vignette (dependencies are suggested, later to be moved to required)

should fix travis

now fix again

second time the same.

give up

give up on travis. wont work

?

fix last time

again

tests and comments (geo too)
update functions, move 4x4 to utils.R
update docu + .md files

add geojson support. not finished

Merge pull request Chicago#53 from dmpe/dev

Split functions and other smaller improvements
fix leaflet example, not SP object but the list and add a new contributor
fix test [skip ci]

Merge pull request Chicago#56 from dmpe/dev-geojson-big

bind -> rbind.fill; getMetadata; GeoJSON method
Fixed version number scheme

Using building numbers, x.y.z-b. The "z" should only be incremented on bug releases being planned.
Human-readable URLs are not being deprecated

Fixes build url under certain domain cases

Re-added human-readable URL

Added unit tests for broken-out URLs

Removed unnecessary library listings

Cleaned-up documentation

Increment build info

Updated help files

Removed vignettes, updated documentation for style

Example formatting issues, cleaned-up comments

Turned on tests

Examples on read.socrata taking too long, stopped their run

Merge pull request #4 from Chicago/dev

Dev
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants