Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Comparison with dplyr and Stata #2329

Merged
merged 38 commits into from
Aug 22, 2020
Merged

Comparison with dplyr and Stata #2329

merged 38 commits into from
Aug 22, 2020

Conversation

matthieugomez
Copy link
Contributor

@matthieugomez matthieugomez commented Jul 24, 2020

This adds a very simple table to compare the main functions in DataFrames with dplyr and Stata.
I focused on the main functions to make it as simple as possible.

@bkamins bkamins added the doc label Jul 24, 2020
@bkamins bkamins added this to the 1.0 milestone Jul 24, 2020
@bkamins
Copy link
Member

bkamins commented Jul 24, 2020

Thank you for working on it. I have left some comments.

Copy link
Contributor

@pdeffebach pdeffebach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this! It's super helpful.

I think we could make this a bit more biased towards DataFrames. For instance, working with variables rather than literals in a comparison between DataFrames and dplyr. Similarly, we should show off describe more.

@matthieugomez
Copy link
Contributor Author

I've answered the comments and added two examples of functions that return dataframes rather than vectors.

@bkamins
Copy link
Member

bkamins commented Jul 24, 2020

Thank you for the fixes. Let us wait for @nalimilan to have a look at the PR.

matthieugomez and others added 2 commits August 4, 2020 06:45
Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
Copy link
Contributor

@pdeffebach pdeffebach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

In general the html output is a bit awkward. Lots of lines are cut off. Bit I'm not sure what we can do about that.

matthieugomez and others added 5 commits August 4, 2020 08:26
Co-authored-by: pdeffebach <23196228+pdeffebach@users.noreply.github.com>
Co-authored-by: pdeffebach <23196228+pdeffebach@users.noreply.github.com>
@bkamins
Copy link
Member

bkamins commented Aug 4, 2020

Maybe we should switch from a table to a list of lists? Not sure what would look better though

@matthieugomez
Copy link
Contributor Author

matthieugomez commented Aug 4, 2020

We could split the comparison into a dplyr and stata one, which would make each table less wide. Moreover, in reality people tend to be interested in either comparison, not both at the same time.

@bkamins
Copy link
Member

bkamins commented Aug 4, 2020

Good point - agreed. It will create some duplication but I do not think it is a problem (we can make separate sections for them).

@matthieugomez
Copy link
Contributor Author

matthieugomez commented Aug 4, 2020

Done. This led me to add a Stata specific operation: transform certain rows. Makes me think that it would be nice to define some kind of replace signature to make it easier.

@bkamins
Copy link
Member

bkamins commented Aug 4, 2020

Makes me think that it would be nice to define some kind of replace signature to make it easier.

Indeed, especially as ifelse.(y .>= 1, 1, x)) will fail with missing values (our usual scenario). Can you open an Issue please with what you think would be good (or just a call to add it so that we keep track of it - I have a remote memory that we already discussed it, but it might have been lost so it is better to have an issue for it, as it is a common operation)

@pdeffebach
Copy link
Contributor

I think passmissing(ifelse).(...) is good enough to include here.

@bkamins
Copy link
Member

bkamins commented Aug 4, 2020

Unfortunately it is incorrect. You have to write ifelse.(.!ismissing.(y) .& y .>= 1, 1, x).

Using passmissing(ifelse) will produce missing in the result instead of retaining the value stored in x.

@bkamins
Copy link
Member

bkamins commented Aug 4, 2020

This is the reason that I am on board with efforts to make handling missing easier (but I am too busy with other PRs now to sit down to it seriously). What were the latest ideas to solve this issue (it was a macro if I recall correctly - right?).

or in the original code:

ifelse.(coalesce.(y .>= 1, false), 1, x)

which feels more idiomatic.

@matthieugomez
Copy link
Contributor Author

I think it’s #2211

@bkamins
Copy link
Member

bkamins commented Aug 13, 2020

@nalimilan - ok to merge?

@bkamins
Copy link
Member

bkamins commented Aug 20, 2020

@nalimilan - I would merge it if you do not have additional comments.

@bkamins
Copy link
Member

bkamins commented Aug 21, 2020

So - is this good to merge?

@nalimilan
Copy link
Member

Just waiting for Travis... :-p

@nalimilan nalimilan merged commit 6ea3070 into JuliaData:master Aug 22, 2020
@nalimilan
Copy link
Member

Thanks!

@nalimilan
Copy link
Member

The new page doesn't show up in the manual. I think you need to add it to the list too. Also there are a few issues with some tables, see https://juliadata.github.io/DataFrames.jl/latest/man/comparisons/

||`combine(df, names(df, r"^x") .=> mean)`|`collapse (mean) x*`|
||`combine(df, ([:x, :y] .=> [maximum minimum])...)`|`collapse (max) x y (min) x y`|
|Multivariate function|`transform!(df, [:x, :y] => cor => :z)`|`egen z = corr(x y)`|
|Row-wise|`transform!(df, [:x, :y] => ByRow(min) => :z)`|`egen z = rowmin(x y)`|
Copy link
Member

@nalimilan nalimilan Aug 26, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it would be useful to add an example of a simple row-wise function using gen in Stata? For example it's not obvious that ByRow is also useful for things like :x => ByRow(x -> x^2) => :x².

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants