-
Notifications
You must be signed in to change notification settings - Fork 82
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
bind -> rbind.fill; getMetadata; GeoJSON method #56
Conversation
@@ -19,6 +19,7 @@ r_github_packages: | |||
- jeroenooms/curl | |||
- klutometis/roxygen | |||
- jimhester/covr | |||
- yihui/mime | |||
- ropensci/geojsonio |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We now also require mime
version >= 4.0 which has been published to CRAN
https://cran.rstudio.com/web/packages/mime/index.html
@@ -1,34 +1,46 @@ | |||
# An interface to data hosted online in Socrata data repositories | |||
# This is the main file which uses other functions to download data from a Socrata repositories | |||
# | |||
# Author: Hugh J. Devlin, Ph. D. 2013-08-28 | |||
# Author: Hugh J. Devlin, Ph.D et al. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to be fixed later
DONE
37d16fe
to
7fe80fb
Compare
Hi @tomschenkjr & @geneorama , |
} | ||
|
||
# http://stackoverflow.com/a/7964098 | ||
# For TRANSITION: it will be easier for users of CSV to translate to JSON by warning them. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: transit
ca89a33
to
0982a6b
Compare
@dmpe - Quite a few notes to get through. At a high-level, I think it's important to remember two key goals of the package. Both of which basically sum-up to making it "easy" to work with Socrata URLs:
There are three primary interactions with RSocrata: A. Use a valid SoDA API asking for a JSON file: e.g., https://data.example.com/resource/four-four.json Finally, in the spirit of semantic versioning, we should be mindful of introducing anything that is not backward compatible for the end-user. Everything "behind-the-scenes" can be reworked more frequently to find the better solution. Breaking backward compatibility should be delayed if not entirely avoided. The above items should be our central tenants. As long as these are met, then I'm good with any modification. So, keeping that in mind with the comment above, here are my thoughts:
|
Hi Tom,
Plyr
Yep, no option for the user. Everything will be done in JSON.
We can certainly have this:
Fail. I wrote something where text and code are saying two different things. In fact, the current changes are backwards compatible! What it does is that it takes a Sorry for confusion! I deleted that previous message above. In case, you are going to download data using human-readable URL, with a csv suffix you will get this:
|
…utor fix test [skip ci]
Hi, @tomschenkjr Any update on this ? I believe i have addressed your points. But if not, please let me know. |
Looks good. Pulling into the dev branch. |
bind -> rbind.fill; getMetadata; GeoJSON method
Thank you ! |
FYI, I have a small project that I'm running the dev version and hitting some issues that I'll be posting on. In the meantime, here is the code I'm running: ls_chicago <- ls.socrata("https://data.cityofchicago.org")
four_by_four <- substr(ls_chicago$identifier, 42, 50)
read.socrata(url = "http://data.cityofchicago.org",
fourByFour = four_by_four[1]) Resulting in:
Also, the unit testing seemed to fail, but can resolve those shortly. |
Thanks, I saw that too. |
Ok, for one, you need to replace
with
Did it work always ? Because I don't know that we can even do so. URL parameters always - I believe - consisted of the whole, complete URL |
Got it, that's correct. Sorry, my transcription error. The error I had meant to post was
|
I got this error too. BUT at random, non reproducible. I have no idea how to deal with it. Plus, what does that even mean, because |
Plus it is case by case basis. Some work without the error, while others not. I will soon provide you a link (not a PR), which will get some things better (but not this): Maybe something like https://stackoverflow.com/questions/19641808/r-error-in-connection-to-host |
Ok, this seems to be the offending line. I'm submitting a couple of corrections to dev after I check out a few more things and write a couple of tests to try to catch these errors in the future. I'm also going to remove the warnings being thrown on human-readable URL support. Since it's not being deprecated, and it's a backend adjustment that doesn't directly impact the code, just need to note it in release notes. |
Pushed changes to the repo. Took awhile to get this into build into a package, so need to keep that in order. Having trouble building the examples, so need to insert more I also had to remove the vignettes to build the package. Opened #59 to note their addition again. The new geojson feature will be great. Will need some tweaks but will be able to get into 1.7. |
@tomschenkjr ad Problem with build URL: Please, take also a look on https://cran.r-project.org/web/packages/urltools/index.html as I am more and more now with the university stuff and have less and less time to boot VM and work on Rsocrata. @ I also had to remove the vignettes to build the package. Opened #59 to note their addition again. |
PS: https://github.com/Chicago/RSocrata/blob/dev/R/returnData.R#L102 |
Merge pull request Chicago#51 from Chicago/dev Pull request for v1.6.1 Remove NEWS.md from build to avoid NOTE Build number bump and date change R 3.2.1 wants all URLs in canonical form--has been changed Fix type-o in Coveralls badge Merge remote-tracking branch 'upstream/dev' into dev split & improve 4x4 logic (tests pass) bump to 1.6.3 Add new Floatin Timestamp format for posixify + add test split validate + test validate url split test fix travis don't run, not do not run fix travis again add errorHandling function and rename getResponse to checkResponse update test else if [skip ci] add not finished test for Chicago#27, Chicago#24 [skip ci] go over sprint7 branch [skip ci] text/plain [skip ci] fix paste -> paste0 because of sep="" which I deleted delete note.md + update date [skip ci] geo readme Merge remote-tracking branch 'upstream/dev' into dev Merge remote-tracking branch 'origin/dev' into dev Conflicts: DESCRIPTION add geojson example & vignette (dependencies are suggested, later to be moved to required) should fix travis now fix again second time the same. give up give up on travis. wont work ? fix last time again tests and comments (geo too) update functions, move 4x4 to utils.R update docu + .md files add geojson support. not finished Merge pull request Chicago#53 from dmpe/dev Split functions and other smaller improvements fix leaflet example, not SP object but the list and add a new contributor fix test [skip ci] Merge pull request Chicago#56 from dmpe/dev-geojson-big bind -> rbind.fill; getMetadata; GeoJSON method Fixed version number scheme Using building numbers, x.y.z-b. The "z" should only be incremented on bug releases being planned. Human-readable URLs are not being deprecated Fixes build url under certain domain cases Re-added human-readable URL Added unit tests for broken-out URLs Removed unnecessary library listings Cleaned-up documentation Increment build info Updated help files Removed vignettes, updated documentation for style Example formatting issues, cleaned-up comments Turned on tests Examples on read.socrata taking too long, stopped their run Merge pull request #4 from Chicago/dev Dev
Merge pull request Chicago#51 from Chicago/dev Pull request for v1.6.1 Remove NEWS.md from build to avoid NOTE Build number bump and date change R 3.2.1 wants all URLs in canonical form--has been changed Fix type-o in Coveralls badge Merge remote-tracking branch 'upstream/dev' into dev split & improve 4x4 logic (tests pass) bump to 1.6.3 Add new Floatin Timestamp format for posixify + add test split validate + test validate url split test fix travis don't run, not do not run fix travis again add errorHandling function and rename getResponse to checkResponse update test else if [skip ci] add not finished test for Chicago#27, Chicago#24 [skip ci] go over sprint7 branch [skip ci] text/plain [skip ci] fix paste -> paste0 because of sep="" which I deleted delete note.md + update date [skip ci] geo readme Merge remote-tracking branch 'upstream/dev' into dev Merge remote-tracking branch 'origin/dev' into dev Conflicts: DESCRIPTION add geojson example & vignette (dependencies are suggested, later to be moved to required) should fix travis now fix again second time the same. give up give up on travis. wont work ? fix last time again tests and comments (geo too) update functions, move 4x4 to utils.R update docu + .md files add geojson support. not finished Merge pull request Chicago#53 from dmpe/dev Split functions and other smaller improvements fix leaflet example, not SP object but the list and add a new contributor fix test [skip ci] Merge pull request Chicago#56 from dmpe/dev-geojson-big bind -> rbind.fill; getMetadata; GeoJSON method Fixed version number scheme Using building numbers, x.y.z-b. The "z" should only be incremented on bug releases being planned. Human-readable URLs are not being deprecated Fixes build url under certain domain cases Re-added human-readable URL Added unit tests for broken-out URLs Removed unnecessary library listings Cleaned-up documentation Increment build info Updated help files Removed vignettes, updated documentation for style Example formatting issues, cleaned-up comments Turned on tests Examples on read.socrata taking too long, stopped their run Merge pull request #4 from Chicago/dev Dev
Merge pull request Chicago#51 from Chicago/dev Pull request for v1.6.1 Remove NEWS.md from build to avoid NOTE Build number bump and date change R 3.2.1 wants all URLs in canonical form--has been changed Fix type-o in Coveralls badge Merge remote-tracking branch 'upstream/dev' into dev split & improve 4x4 logic (tests pass) bump to 1.6.3 Add new Floatin Timestamp format for posixify + add test split validate + test validate url split test fix travis don't run, not do not run fix travis again add errorHandling function and rename getResponse to checkResponse update test else if [skip ci] add not finished test for Chicago#27, Chicago#24 [skip ci] go over sprint7 branch [skip ci] text/plain [skip ci] fix paste -> paste0 because of sep="" which I deleted delete note.md + update date [skip ci] geo readme Merge remote-tracking branch 'upstream/dev' into dev Merge remote-tracking branch 'origin/dev' into dev Conflicts: DESCRIPTION add geojson example & vignette (dependencies are suggested, later to be moved to required) should fix travis now fix again second time the same. give up give up on travis. wont work ? fix last time again tests and comments (geo too) update functions, move 4x4 to utils.R update docu + .md files add geojson support. not finished Merge pull request Chicago#53 from dmpe/dev Split functions and other smaller improvements fix leaflet example, not SP object but the list and add a new contributor fix test [skip ci] Merge pull request Chicago#56 from dmpe/dev-geojson-big bind -> rbind.fill; getMetadata; GeoJSON method Fixed version number scheme Using building numbers, x.y.z-b. The "z" should only be incremented on bug releases being planned. Human-readable URLs are not being deprecated Fixes build url under certain domain cases Re-added human-readable URL Added unit tests for broken-out URLs Removed unnecessary library listings Cleaned-up documentation Increment build info Updated help files Removed vignettes, updated documentation for style Example formatting issues, cleaned-up comments Turned on tests Examples on read.socrata taking too long, stopped their run Merge pull request #4 from Chicago/dev Dev
Functions to speed up the row binding
So, to begin with, there are 4 (to me) known functions which can be used to bind rows together. These are:
rbind
frombase
This is is known to be slow with large datasets. For example, see this: http://rpubs.com/wush978/6302
rbind.fill
fromplyr
A faster version of
rbind
, implemented in R. And the only other function which can be used for our purpose.bind_rows
fromdplyr
: This is a faster (C-implemented) version ofrbind.fill
fromplyr
. The only significant change for us is the fact that it is far more stricter in terms of types which can be permitted in the column. While plyr "doesn't seem to care much" and will "just do it",dplyr
will check it and throw an error if there is e.g. an integer and character in the same column. There is already a bug filled, which has been closed Feature request - rbind_all type coercion: 'Just do it' option? tidyverse/dplyr#1162 with hadley sayingI'm currently happy with the behaviour of bind_rows(), and if you'd prefer the rbindlist() behaviour, you're free to use that.
Thus, as we cannot really on it (and from my own experience working with dplyr), this package would be out of touch.
rbindlist
fromdata.table
:I'm hesitant of data.table. I'm not a huge fan of their slightly off-kilter data frame. data.table is fast, but I worry they just aren't going to get a critical mass of users to make it well-known to many R programmers.
Yes, I fully agree on this and this is also a reason why i have decided not to apply it here.
If we have decided to do it, then we would make it in part backwards incompatible (and would be required to run a marketing campaign to our users :) )Tests
I have commited the benchmarking file (Rmd) here without appropriate functions because otherwise it would be a huge mess. Here is a gist for them: https://gist.github.com/dmpe/5aec87f0c7a5ae2115ca
dplyr
because it will print an error, with the second one there is no such a problem. This just confirms the issue withdplyr
I have described above. Again, with the third one below,dplyr
will fail again.datatable
; 2.plyr
; 3 .base
; 4.dplyr
(? what ?)datatable
; 2.dplyr
; 3 .plyr
; 4.base
(This already makes sense)dplyr
is complicated anddata.table
is a no-go, the only options is here either to useplyr
or stick withbase
rbind
. Due to the small (and with larger datasets bigger) speed up, I decided to chooseplyr
. But this can be - of course - further discussed. BTW, I also think and actually saw that it will be largely a dataset dependent, i.e. with some smaller datasetsrbind
would be faster, while with some otherplyr
would be better.Third one:
This will fail at dplyr again; without it: 1.
datatable
; 2.plyr
; 3.base
Detecting file output (geojson, csv, json)
Excellent! I wonder, though, if this should be a different method, we detect it using the extension, or use an option within read.socrata. For instance, here are alternative scenarios:
Yes, I though about that when creating this patch set.
The reason why I have decided for splitting GEO method was, that the method
read.socrata
would be otherwise really huge.Given your
geojson
choice, the methodgeojson_read
fromgeojsonio
package offers several arguments such asmethod
,parse
,what
and these must be accessible for the user. Thus you would do something likeread.socrata <- function(url = NULL, app_token = NULL, limit = 50000, domain = NULL, fourByFour = NULL, query = NULL, offset = 0, output = c("csv", "json", "geojson"), ...) {}
where ... would be used for
geojson_read
arguments.OR
read.socrata <- function(url = NULL, app_token = NULL, limit = 50000, domain = NULL, fourByFour = NULL, query = NULL, offset = 0, output = c("csv", "json", "geojson"), method = "something", parse = "something", what="something") {}
However, the last 3 will apply only to the "geojson" choice. One would document that in the documentation, but just looking at such function I would be confused.
The current read.socrataGEO method is small and very clear to the user as it deals only with the GEO stuff. While the other one, is used for CSV & JSON stuff exclusively.
But, look, I agree that this is largely a personal preference and nothing stops us in doing what Tom has suggested. This is also a reason why I wanted some feedback on my changes.
Changed behaviour
httr::GET()
request in theerrorHandling()
method, so we can speare dealing with$$token=
somewhere in the code. Thus, tests have been edited for this change