add WhereDataFrame #2467

matthieugomez · 2020-10-03T23:31:56Z

This pull request fleshes out a WhereDataFrame type, constructed using where.

filter/filter!, delete!, transform/transform!, combine, view operate on WhereDataFrames

It is similar in spirit to SQL where, data.table i, or Stata if.
PR would solve #2354, #2323, #2211

bkamins · 2020-10-04T06:30:35Z

src/wheredataframe/wheredataframe.jl

+    WhereDataFrame{<:AbstractDataFrame,<:AbstractIndex}
+
+    The result of a [`where`](@ref) operation on an `AbstractDataFrame`; a
+    subset of a `AbstractDataFrame`


can you please highlight that this is a wrapper (as opposed to SubDataFrame which is AbstractDataFrame?)

Yes. Just to be clear, this draft is just a way to start a discussion about this potential syntax — I figured it may be interesting for people to try it out a bit.

Understood - I highlighted in the comments the major design considerations I see with this proposal.

bkamins · 2020-10-04T06:32:49Z

src/wheredataframe/wheredataframe.jl

+`args...` obey the same syntax as `select(d, args...)`
+Rows that return missing are understood as false
+
+- `filter`/`filter!` returns an AbstractDataFrame after filtering (resp. deleting) specified rows


please add that this is a list of functions that support where wrapper. In general - in the long term almost all functions we define could support it I think (but we can make this change gradually)

in particular you omit select/select! now and e.g. insertcols!.

bkamins · 2020-10-04T06:34:08Z

src/wheredataframe/wheredataframe.jl

+```
+"""
+function where(df::AbstractDataFrame, args...)
+    dfr = select(df, args...)


probably copycols=false should be passed here

also check the case when dfr has 0 columns

bkamins · 2020-10-04T06:38:11Z

src/wheredataframe/wheredataframe.jl

+Base.summary(io::IO, wdf::WhereDataFrame) = summary(wdf)
+
+
+function Base.show(io::IO,


All mime types should be supported (in particular HTML and LaTeX). Also I would make sure that we very clearly visually indicate that this is NOT an AbstractDataFrame.

Finally - standard show has to be updated to avoid StackOverflow when trying to display a data frame with WhereDataFrame in a circular reference.

Also in general - I would wait with merging this PR after PrettyTables.jl integration is done and then update it accordingly. c.f. #2429

bkamins · 2020-10-04T06:40:07Z

src/wheredataframe/wheredataframe.jl

+##
+##############################################################################
+
+Base.filter(wdf::WhereDataFrame) = parent(wdf)[rows(wdf), :]


we should support view kwarg in filter.

bkamins · 2020-10-04T06:43:10Z

src/wheredataframe/wheredataframe.jl

+transform(wdf::WhereDataFrame, args...; copycols::Bool=true, renamecols::Bool=true) =
+    manipulate(wdf, :, args..., copycols=copycols, renamecols=renamecols)
+
+function manipulate(wdf::WhereDataFrame, cs...; copycols::Bool, renamecols::Bool)


this should wait till #2461 is merged - in case there are some changes to the logic on this level.

(buy "this" - I mean all the functions that do low-level transform and transform!)

bkamins · 2020-10-04T06:46:28Z

This would also solve #2465.

bkamins · 2020-10-04T06:47:38Z

This would not solve #2211, but I think #2211 could be closed - as the case would be handled via transform! or insertcols! - right?

bkamins · 2020-10-04T07:00:19Z

Thank you for submitting this PR (it is really great to have you contribute).

The approach you propose is a valid alternative to making SubDataFrame on Index I think. It is less flexible, but I can see its benefits - which are consequences of lower flexibility. The crucial thing is to decide on the list of functions that should support WhereDataFrame (and make sure this design is easily understood by the users). Let us see what @nalimilan and @pdeffebach think (especially with relation to DataFramesMeta.jl).

For me - the big point that I am not sure will work nicely with the proposed design - is how do you want to support groupby here? Do you want to define where for GroupedDataFrame similarly, or you want to pass where wrapper of a data frame to groupby, or you do not want to support it at all (I would assume you want to support it)?

(also at some point please remember to update the manual, the docstrings and NEWS.md - but this should be done at the end when we have design settled)

On a development side let me make a small comment (as I am not sure you are aware of this - and with large PRs this can easily become a problem). Each push to GitHub triggers CI. For DataFrames.jl CI takes over 1h for each push. CI is shared across JuliaData. This means that doing a lot of pushes to DataFrames.jl easily stalls CI for all JuliaData projects for e.g. one day. Therefore, for DataFrames.jl, it is strongly encouraged to limit the number of pushes as much as possible (i.e. preferably there should be one push from local repo to GitHub when you are happy with what you have and when you get reviewing suggestions - they should be put in a batch and committed together).

bkamins · 2020-10-04T07:05:28Z

src/wheredataframe/wheredataframe.jl

+combine(wdf::WhereDataFrame, args...; kwargs...) = combine(view(wdf), args...; kwargs...)
+DataFrame(wdf::WhereDataFrame; copycols::Bool=true) = DataFrame(view(wdf); copycols = copycols)
+DataAPI.describe(wdf::WhereDataFrame, args...; kwargs...) = describe(view(wdf), args...; kwargs...)


since where is a wrapper (so WhereDataFrame is not an AbstractDataFrame) I think I would be better to avoid these methods and require view to be called by the user explicitly. Not sure - but this would better highlight that where is a lightweight wrapper only.

The point is that e.g. transform(wdf) and combine(wdf, :) will return a completely different result with your proposed design and I think it would be better to avoid this. Intuitively I would support only these methods for WhereDataFrame where its behavior differs from SubDataFrame and for other methods require to call view on it. Then the message to the user would be consistent.

bkamins · 2020-10-04T09:10:33Z

Just to add what I just commented to @nalimilan. If we accept this design (which is one of the two options - the other is to make SubDataFrame more flexible) then I think WhereDataFrame should be designed to highlight that it should be a short-lived object not used stand-alone (so maybe it even should not have show implemented). So where(df, ...) would be a kind-of modifier like ByRow or AsTable in a selected list of functions (maybe it should be Where then not where and the object should be called Where, as it is not an AbstractDataFrame, so WhereDataFrame is actually a confusing name).

The point is that Where(df, ...) is a syntax should be meant to modify how a limited number of functions should work (as opposed to SubDataFrame which is a fully operational AbstractDataFrame). This is a key decision to make - do we prefer a modifier syntax (Where) or add functionality to SubDataFrame.

Seeing your PR I actually think that maybe Where would actually be a better approach, but it requires a very careful thought:

which functions and how should support it
is it convenient to add such functionality to these functions (i.e. if the internal design will not become overly complex and hard to maintain)
are we happy with what we get as a result (i.e. if all operations that we think are common and important can be conveniently handled using Where)

nalimilan · 2020-10-04T10:14:24Z

Thanks for making a PR. Though I'd prefer to settle the discussion on the best design before diving into the implementation. Otherwise out time may be wasted and/or we may not consider all available options because we are too focused on a particular approach.

More specifically, my concern with a lazy where is that in general in Julia and in DataFrames functions are eager by default, and we wouldn't have an exact equivalent to where which wouldn't return a view. This could be confusing as filtering is a very common operation, and currently filter isn't perfectly convenient as you noted at #2323 and #2211. If we added a lazy where, then we still wouldn't have an eager variant with convenient features (like skipping missings and selecting rows inside GroupedDataFrame) -- and I don't like filter(where(...)) as a replacement as it seems redundant. So I think we should decide first whether we want an eager where, a lazy where as in this PR, or both (with different names), and how they should work.

bkamins · 2020-10-04T10:45:51Z

Some more comments to keep in mind:

SubDataFrame does not remember its source, only its true parent (this is something that makes Where preferable over SubDataFrame)
still we would have to decide how where works with GroupedDataFrame (should it be where(gdf, ...) or groupby(where(df, ...), ...) in particular)
per @nalimilan comment I think mixing lazy and non lazy variants in one function (differentiated by kwarg) is not desirable as it would lead to transform(where(df, ..., lazy=true), ...) and transform(where(df, ..., lazy=false), ...) produce significantly different results

pdeffebach · 2020-10-04T13:40:30Z

@matthieugomez thanks for this. I'm glad someone finally went and implemented it.

A few "big picture" questions

This can cause anti-patterns where people port their stata workflow to Julia. In Stata, since you can only have one data set in RAM at a time, people often store multiple versions of their data in the same data-set and work with them separately with if statements.

I don't think any of us want people using where when they should use a new data frame all together. Can you give some examples of use-cases that aren't this?

In general, I would say this is no substitute for making skipping values easier. If you have missing values in :long_column_name and want to, for example, normalize by mean and SD, your options are (in DataFramesMeta)

@pipe df |>
    transform(_, :long_column_name = (:long_column_name .- mean(skipmissing(:long_column_name)) .- std(skipmissing(:long_column_name) )

or

@pipe df 
    where(_, .!missing.(:long_column_name)) |>
    transform(_, :long_column_name = :long_column_name .- mean(:long_column_name) ./ std(:long_column_name))

Either way that's a lot of typing. Given that we still want, in the future, an easier way to skip missing values, is this feature still worth it?

Basically, if we take away an anti-pattern and the benefits of skipping missing values, what are the remaining benefits of this PR?

matthieugomez · 2020-10-04T22:40:28Z

It is probably too complex. I think a simpler where that simply creates a dataframe (or a view with view = true) would be easier to reason about. While it does not solve the problem of updating a variable if a condition is satisfied, there could be an update verb down the line to do so (I don’t think using views for that is a good idea).

pdeffebach · 2020-10-04T22:48:31Z

I will start experimenting with an @if macro in DataFramesMeta and see what I can come up with.

pdeffebach · 2021-06-08T17:23:49Z

I no longer think the Stata-esque dataset-in-a-dataset is such a bad anti-pattern.

I wonder if we can implement this but having a mutable = true flag for a SubDataFrame? Should behave the same in all instances, except setindex!.

bkamins · 2021-06-08T17:42:17Z

Can you open an issue for this, since this is closed so it is easier to track?

I think a flag could be OK (we could even allow for mutation of this flag post creation). But it requires a careful consideration of the design (maybe repeating what we already discussed to refresh memory and include recent experiences). Still I would not allow it by default I think (as views should be immutable in general)

In general - the more I think about this issue the more I feel we need to resolve the where issue in a nice way (i.e. something that will be user friendly in most of the cases even at the expense of potential performance hiccups or some inconsistency in corner cases - which would be e.g. disallowed)

bkamins · 2021-06-08T19:13:54Z

Before you opened a new issue let me write down the thoughts here.

I have reviewed the implementation and it should be possible to implement it without much redesign. We might even consider adding this feature by default (i.e. without requiring mutable=true). The only caveat is that mutating via a view might invalidate other views, but we already allow this in select! and transform! for GroupedDataFrame so maybe it is OK (we just need to make sure to clearly signal the risks in the documentation).

The key thing to think of is how to design:

setindex! (easy)
broadcasting assignment
select! (the question is if we need to support select! as it potentially drops columns)
transform!

I assume that what we should allow is ADDING new columns only, and not replacing old columns (setindex! for old columns already works and has a defined API - in particular it does not allow changing of column eltype)

bkamins · 2021-06-08T20:16:25Z

Actually maybe we could allow replacing columns - we need to list all possible operations in the new issue and decide which should work and how.

pdeffebach · 2021-06-08T20:22:17Z

Fwiw, Stata doesn't allow replacing columns, as in you can't set everything else to missing and you can't change the type.

But we should look into this more seriously in another issue, and document Stata's behavior.

bkamins · 2021-06-08T20:30:47Z

The same is done by data.table (i.e. you cannot set missing to an existing column in filtered-out rows and column eltype must match). As I have said - the question is if we want to allow it. The issue is that we have df.a, df[!, :a] and df[:, :a] at our disposal - this is the flexibility that other ecosystems do not have.

matthieugomez marked this pull request as draft October 4, 2020 00:16

matthieugomez added 5 commits October 3, 2020 21:01

add WhereDataFrame

2d3f55c

Update wheredataframe.jl

de04ad2

Update wheredataframe.jl

1805392

Update DataFrames.jl

25b1bc6

Update wheredataframe.jl

5367b46

bkamins reviewed Oct 4, 2020

View reviewed changes

bkamins added decision feature grouping question labels Oct 4, 2020

bkamins added this to the 1.x milestone Oct 4, 2020

matthieugomez closed this Oct 4, 2020

nalimilan mentioned this pull request Jun 10, 2021

Assignment to SubDataFrame #2785

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add WhereDataFrame #2467

add WhereDataFrame #2467

matthieugomez commented Oct 3, 2020 •

edited

Loading

bkamins Oct 4, 2020

matthieugomez Oct 4, 2020

bkamins Oct 4, 2020

bkamins Oct 4, 2020

bkamins Oct 4, 2020

bkamins Oct 4, 2020

bkamins Oct 4, 2020

bkamins Oct 4, 2020

bkamins Oct 4, 2020

bkamins Oct 4, 2020

bkamins Oct 4, 2020 •

edited

Loading

bkamins commented Oct 4, 2020

bkamins commented Oct 4, 2020

bkamins commented Oct 4, 2020

bkamins Oct 4, 2020

bkamins commented Oct 4, 2020 •

edited

Loading

nalimilan commented Oct 4, 2020

bkamins commented Oct 4, 2020

pdeffebach commented Oct 4, 2020

matthieugomez commented Oct 4, 2020

pdeffebach commented Oct 4, 2020

pdeffebach commented Jun 8, 2021

bkamins commented Jun 8, 2021 •

edited

Loading

bkamins commented Jun 8, 2021

bkamins commented Jun 8, 2021

pdeffebach commented Jun 8, 2021

bkamins commented Jun 8, 2021

		Base.summary(io::IO, wdf::WhereDataFrame) = summary(wdf)


		function Base.show(io::IO,

add WhereDataFrame #2467

add WhereDataFrame #2467

Conversation

matthieugomez commented Oct 3, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bkamins Oct 4, 2020 • edited Loading

Choose a reason for hiding this comment

bkamins commented Oct 4, 2020

bkamins commented Oct 4, 2020

bkamins commented Oct 4, 2020

Choose a reason for hiding this comment

bkamins commented Oct 4, 2020 • edited Loading

nalimilan commented Oct 4, 2020

bkamins commented Oct 4, 2020

pdeffebach commented Oct 4, 2020

matthieugomez commented Oct 4, 2020

pdeffebach commented Oct 4, 2020

pdeffebach commented Jun 8, 2021

bkamins commented Jun 8, 2021 • edited Loading

bkamins commented Jun 8, 2021

bkamins commented Jun 8, 2021

pdeffebach commented Jun 8, 2021

bkamins commented Jun 8, 2021

matthieugomez commented Oct 3, 2020 •

edited

Loading

bkamins Oct 4, 2020 •

edited

Loading

bkamins commented Oct 4, 2020 •

edited

Loading

bkamins commented Jun 8, 2021 •

edited

Loading