Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

add WhereDataFrame #2467

Closed
wants to merge 5 commits into from
Closed

add WhereDataFrame #2467

wants to merge 5 commits into from

Conversation

matthieugomez
Copy link
Contributor

@matthieugomez matthieugomez commented Oct 3, 2020

This pull request fleshes out a WhereDataFrame type, constructed using where.

filter/filter!, delete!, transform/transform!, combine, view operate on WhereDataFrames

It is similar in spirit to SQL where, data.table i, or Stata if.
PR would solve #2354, #2323, #2211

@matthieugomez matthieugomez marked this pull request as draft October 4, 2020 00:16
WhereDataFrame{<:AbstractDataFrame,<:AbstractIndex}

The result of a [`where`](@ref) operation on an `AbstractDataFrame`; a
subset of a `AbstractDataFrame`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you please highlight that this is a wrapper (as opposed to SubDataFrame which is AbstractDataFrame?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Just to be clear, this draft is just a way to start a discussion about this potential syntax — I figured it may be interesting for people to try it out a bit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood - I highlighted in the comments the major design considerations I see with this proposal.

`args...` obey the same syntax as `select(d, args...)`
Rows that return missing are understood as false

- `filter`/`filter!` returns an AbstractDataFrame after filtering (resp. deleting) specified rows
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add that this is a list of functions that support where wrapper. In general - in the long term almost all functions we define could support it I think (but we can make this change gradually)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in particular you omit select/select! now and e.g. insertcols!.

```
"""
function where(df::AbstractDataFrame, args...)
dfr = select(df, args...)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably copycols=false should be passed here

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also check the case when dfr has 0 columns

Base.summary(io::IO, wdf::WhereDataFrame) = summary(wdf)


function Base.show(io::IO,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All mime types should be supported (in particular HTML and LaTeX). Also I would make sure that we very clearly visually indicate that this is NOT an AbstractDataFrame.

Finally - standard show has to be updated to avoid StackOverflow when trying to display a data frame with WhereDataFrame in a circular reference.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also in general - I would wait with merging this PR after PrettyTables.jl integration is done and then update it accordingly. c.f. #2429

##
##############################################################################

Base.filter(wdf::WhereDataFrame) = parent(wdf)[rows(wdf), :]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should support view kwarg in filter.

transform(wdf::WhereDataFrame, args...; copycols::Bool=true, renamecols::Bool=true) =
manipulate(wdf, :, args..., copycols=copycols, renamecols=renamecols)

function manipulate(wdf::WhereDataFrame, cs...; copycols::Bool, renamecols::Bool)
Copy link
Member

@bkamins bkamins Oct 4, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should wait till #2461 is merged - in case there are some changes to the logic on this level.

(buy "this" - I mean all the functions that do low-level transform and transform!)

@bkamins
Copy link
Member

bkamins commented Oct 4, 2020

This would also solve #2465.

@bkamins
Copy link
Member

bkamins commented Oct 4, 2020

This would not solve #2211, but I think #2211 could be closed - as the case would be handled via transform! or insertcols! - right?

@bkamins
Copy link
Member

bkamins commented Oct 4, 2020

Thank you for submitting this PR (it is really great to have you contribute).

The approach you propose is a valid alternative to making SubDataFrame on Index I think. It is less flexible, but I can see its benefits - which are consequences of lower flexibility. The crucial thing is to decide on the list of functions that should support WhereDataFrame (and make sure this design is easily understood by the users). Let us see what @nalimilan and @pdeffebach think (especially with relation to DataFramesMeta.jl).

For me - the big point that I am not sure will work nicely with the proposed design - is how do you want to support groupby here? Do you want to define where for GroupedDataFrame similarly, or you want to pass where wrapper of a data frame to groupby, or you do not want to support it at all (I would assume you want to support it)?

(also at some point please remember to update the manual, the docstrings and NEWS.md - but this should be done at the end when we have design settled)

On a development side let me make a small comment (as I am not sure you are aware of this - and with large PRs this can easily become a problem). Each push to GitHub triggers CI. For DataFrames.jl CI takes over 1h for each push. CI is shared across JuliaData. This means that doing a lot of pushes to DataFrames.jl easily stalls CI for all JuliaData projects for e.g. one day. Therefore, for DataFrames.jl, it is strongly encouraged to limit the number of pushes as much as possible (i.e. preferably there should be one push from local repo to GitHub when you are happy with what you have and when you get reviewing suggestions - they should be put in a batch and committed together).

Comment on lines +139 to +141
combine(wdf::WhereDataFrame, args...; kwargs...) = combine(view(wdf), args...; kwargs...)
DataFrame(wdf::WhereDataFrame; copycols::Bool=true) = DataFrame(view(wdf); copycols = copycols)
DataAPI.describe(wdf::WhereDataFrame, args...; kwargs...) = describe(view(wdf), args...; kwargs...)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since where is a wrapper (so WhereDataFrame is not an AbstractDataFrame) I think I would be better to avoid these methods and require view to be called by the user explicitly. Not sure - but this would better highlight that where is a lightweight wrapper only.

The point is that e.g. transform(wdf) and combine(wdf, :) will return a completely different result with your proposed design and I think it would be better to avoid this. Intuitively I would support only these methods for WhereDataFrame where its behavior differs from SubDataFrame and for other methods require to call view on it. Then the message to the user would be consistent.

@bkamins
Copy link
Member

bkamins commented Oct 4, 2020

Just to add what I just commented to @nalimilan. If we accept this design (which is one of the two options - the other is to make SubDataFrame more flexible) then I think WhereDataFrame should be designed to highlight that it should be a short-lived object not used stand-alone (so maybe it even should not have show implemented). So where(df, ...) would be a kind-of modifier like ByRow or AsTable in a selected list of functions (maybe it should be Where then not where and the object should be called Where, as it is not an AbstractDataFrame, so WhereDataFrame is actually a confusing name).

The point is that Where(df, ...) is a syntax should be meant to modify how a limited number of functions should work (as opposed to SubDataFrame which is a fully operational AbstractDataFrame). This is a key decision to make - do we prefer a modifier syntax (Where) or add functionality to SubDataFrame.

Seeing your PR I actually think that maybe Where would actually be a better approach, but it requires a very careful thought:

  • which functions and how should support it
  • is it convenient to add such functionality to these functions (i.e. if the internal design will not become overly complex and hard to maintain)
  • are we happy with what we get as a result (i.e. if all operations that we think are common and important can be conveniently handled using Where)

@nalimilan
Copy link
Member

Thanks for making a PR. Though I'd prefer to settle the discussion on the best design before diving into the implementation. Otherwise out time may be wasted and/or we may not consider all available options because we are too focused on a particular approach.

More specifically, my concern with a lazy where is that in general in Julia and in DataFrames functions are eager by default, and we wouldn't have an exact equivalent to where which wouldn't return a view. This could be confusing as filtering is a very common operation, and currently filter isn't perfectly convenient as you noted at #2323 and #2211. If we added a lazy where, then we still wouldn't have an eager variant with convenient features (like skipping missings and selecting rows inside GroupedDataFrame) -- and I don't like filter(where(...)) as a replacement as it seems redundant. So I think we should decide first whether we want an eager where, a lazy where as in this PR, or both (with different names), and how they should work.

@bkamins
Copy link
Member

bkamins commented Oct 4, 2020

Some more comments to keep in mind:

  • SubDataFrame does not remember its source, only its true parent (this is something that makes Where preferable over SubDataFrame)
  • still we would have to decide how where works with GroupedDataFrame (should it be where(gdf, ...) or groupby(where(df, ...), ...) in particular)
  • per @nalimilan comment I think mixing lazy and non lazy variants in one function (differentiated by kwarg) is not desirable as it would lead to transform(where(df, ..., lazy=true), ...) and transform(where(df, ..., lazy=false), ...) produce significantly different results

@pdeffebach
Copy link
Contributor

@matthieugomez thanks for this. I'm glad someone finally went and implemented it.

A few "big picture" questions

  1. This can cause anti-patterns where people port their stata workflow to Julia. In Stata, since you can only have one data set in RAM at a time, people often store multiple versions of their data in the same data-set and work with them separately with if statements.

I don't think any of us want people using where when they should use a new data frame all together. Can you give some examples of use-cases that aren't this?

  1. In general, I would say this is no substitute for making skipping values easier. If you have missing values in :long_column_name and want to, for example, normalize by mean and SD, your options are (in DataFramesMeta)
@pipe df |>
    transform(_, :long_column_name = (:long_column_name .- mean(skipmissing(:long_column_name)) .- std(skipmissing(:long_column_name) )

or

@pipe df 
    where(_, .!missing.(:long_column_name)) |>
    transform(_, :long_column_name = :long_column_name .- mean(:long_column_name) ./ std(:long_column_name)) 

Either way that's a lot of typing. Given that we still want, in the future, an easier way to skip missing values, is this feature still worth it?

Basically, if we take away an anti-pattern and the benefits of skipping missing values, what are the remaining benefits of this PR?

@matthieugomez
Copy link
Contributor Author

It is probably too complex. I think a simpler where that simply creates a dataframe (or a view with view = true) would be easier to reason about. While it does not solve the problem of updating a variable if a condition is satisfied, there could be an update verb down the line to do so (I don’t think using views for that is a good idea).

@pdeffebach
Copy link
Contributor

I will start experimenting with an @if macro in DataFramesMeta and see what I can come up with.

@pdeffebach
Copy link
Contributor

I no longer think the Stata-esque dataset-in-a-dataset is such a bad anti-pattern.

I wonder if we can implement this but having a mutable = true flag for a SubDataFrame? Should behave the same in all instances, except setindex!.

@bkamins
Copy link
Member

bkamins commented Jun 8, 2021

Can you open an issue for this, since this is closed so it is easier to track?

I think a flag could be OK (we could even allow for mutation of this flag post creation). But it requires a careful consideration of the design (maybe repeating what we already discussed to refresh memory and include recent experiences). Still I would not allow it by default I think (as views should be immutable in general)

In general - the more I think about this issue the more I feel we need to resolve the where issue in a nice way (i.e. something that will be user friendly in most of the cases even at the expense of potential performance hiccups or some inconsistency in corner cases - which would be e.g. disallowed)

@bkamins
Copy link
Member

bkamins commented Jun 8, 2021

Before you opened a new issue let me write down the thoughts here.

I have reviewed the implementation and it should be possible to implement it without much redesign. We might even consider adding this feature by default (i.e. without requiring mutable=true). The only caveat is that mutating via a view might invalidate other views, but we already allow this in select! and transform! for GroupedDataFrame so maybe it is OK (we just need to make sure to clearly signal the risks in the documentation).

The key thing to think of is how to design:

  • setindex! (easy)
  • broadcasting assignment
  • select! (the question is if we need to support select! as it potentially drops columns)
  • transform!

I assume that what we should allow is ADDING new columns only, and not replacing old columns (setindex! for old columns already works and has a defined API - in particular it does not allow changing of column eltype)

@bkamins
Copy link
Member

bkamins commented Jun 8, 2021

Actually maybe we could allow replacing columns - we need to list all possible operations in the new issue and decide which should work and how.

@pdeffebach
Copy link
Contributor

Fwiw, Stata doesn't allow replacing columns, as in you can't set everything else to missing and you can't change the type.

But we should look into this more seriously in another issue, and document Stata's behavior.

@bkamins
Copy link
Member

bkamins commented Jun 8, 2021

The same is done by data.table (i.e. you cannot set missing to an existing column in filtered-out rows and column eltype must match). As I have said - the question is if we want to allow it. The issue is that we have df.a, df[!, :a] and df[:, :a] at our disposal - this is the flexibility that other ecosystems do not have.

# for free to join this conversation on GitHub. Already have an account? # to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants