-
Notifications
You must be signed in to change notification settings - Fork 372
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
add WhereDataFrame #2467
add WhereDataFrame #2467
Conversation
WhereDataFrame{<:AbstractDataFrame,<:AbstractIndex} | ||
|
||
The result of a [`where`](@ref) operation on an `AbstractDataFrame`; a | ||
subset of a `AbstractDataFrame` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you please highlight that this is a wrapper (as opposed to SubDataFrame
which is AbstractDataFrame
?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Just to be clear, this draft is just a way to start a discussion about this potential syntax — I figured it may be interesting for people to try it out a bit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Understood - I highlighted in the comments the major design considerations I see with this proposal.
`args...` obey the same syntax as `select(d, args...)` | ||
Rows that return missing are understood as false | ||
|
||
- `filter`/`filter!` returns an AbstractDataFrame after filtering (resp. deleting) specified rows |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please add that this is a list of functions that support where
wrapper. In general - in the long term almost all functions we define could support it I think (but we can make this change gradually)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in particular you omit select
/select!
now and e.g. insertcols!
.
``` | ||
""" | ||
function where(df::AbstractDataFrame, args...) | ||
dfr = select(df, args...) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably copycols=false
should be passed here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also check the case when dfr
has 0
columns
Base.summary(io::IO, wdf::WhereDataFrame) = summary(wdf) | ||
|
||
|
||
function Base.show(io::IO, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All mime types should be supported (in particular HTML and LaTeX). Also I would make sure that we very clearly visually indicate that this is NOT an AbstractDataFrame
.
Finally - standard show
has to be updated to avoid StackOverflow
when trying to display a data frame with WhereDataFrame
in a circular reference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also in general - I would wait with merging this PR after PrettyTables.jl integration is done and then update it accordingly. c.f. #2429
## | ||
############################################################################## | ||
|
||
Base.filter(wdf::WhereDataFrame) = parent(wdf)[rows(wdf), :] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should support view
kwarg in filter
.
transform(wdf::WhereDataFrame, args...; copycols::Bool=true, renamecols::Bool=true) = | ||
manipulate(wdf, :, args..., copycols=copycols, renamecols=renamecols) | ||
|
||
function manipulate(wdf::WhereDataFrame, cs...; copycols::Bool, renamecols::Bool) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should wait till #2461 is merged - in case there are some changes to the logic on this level.
(buy "this" - I mean all the functions that do low-level transform
and transform!
)
This would also solve #2465. |
Thank you for submitting this PR (it is really great to have you contribute). The approach you propose is a valid alternative to making For me - the big point that I am not sure will work nicely with the proposed design - is how do you want to support (also at some point please remember to update the manual, the docstrings and NEWS.md - but this should be done at the end when we have design settled) On a development side let me make a small comment (as I am not sure you are aware of this - and with large PRs this can easily become a problem). Each push to GitHub triggers CI. For DataFrames.jl CI takes over 1h for each push. CI is shared across JuliaData. This means that doing a lot of pushes to DataFrames.jl easily stalls CI for all JuliaData projects for e.g. one day. Therefore, for DataFrames.jl, it is strongly encouraged to limit the number of pushes as much as possible (i.e. preferably there should be one push from local repo to GitHub when you are happy with what you have and when you get reviewing suggestions - they should be put in a batch and committed together). |
combine(wdf::WhereDataFrame, args...; kwargs...) = combine(view(wdf), args...; kwargs...) | ||
DataFrame(wdf::WhereDataFrame; copycols::Bool=true) = DataFrame(view(wdf); copycols = copycols) | ||
DataAPI.describe(wdf::WhereDataFrame, args...; kwargs...) = describe(view(wdf), args...; kwargs...) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since where
is a wrapper (so WhereDataFrame
is not an AbstractDataFrame
) I think I would be better to avoid these methods and require view
to be called by the user explicitly. Not sure - but this would better highlight that where
is a lightweight wrapper only.
The point is that e.g. transform(wdf)
and combine(wdf, :)
will return a completely different result with your proposed design and I think it would be better to avoid this. Intuitively I would support only these methods for WhereDataFrame
where its behavior differs from SubDataFrame
and for other methods require to call view
on it. Then the message to the user would be consistent.
Just to add what I just commented to @nalimilan. If we accept this design (which is one of the two options - the other is to make The point is that Seeing your PR I actually think that maybe
|
Thanks for making a PR. Though I'd prefer to settle the discussion on the best design before diving into the implementation. Otherwise out time may be wasted and/or we may not consider all available options because we are too focused on a particular approach. More specifically, my concern with a lazy |
Some more comments to keep in mind:
|
@matthieugomez thanks for this. I'm glad someone finally went and implemented it. A few "big picture" questions
I don't think any of us want people using
or
Either way that's a lot of typing. Given that we still want, in the future, an easier way to skip missing values, is this feature still worth it? Basically, if we take away an anti-pattern and the benefits of skipping missing values, what are the remaining benefits of this PR? |
It is probably too complex. I think a simpler |
I will start experimenting with an |
I no longer think the Stata-esque dataset-in-a-dataset is such a bad anti-pattern. I wonder if we can implement this but having a |
Can you open an issue for this, since this is closed so it is easier to track? I think a flag could be OK (we could even allow for mutation of this flag post creation). But it requires a careful consideration of the design (maybe repeating what we already discussed to refresh memory and include recent experiences). Still I would not allow it by default I think (as views should be immutable in general) In general - the more I think about this issue the more I feel we need to resolve the |
Before you opened a new issue let me write down the thoughts here. I have reviewed the implementation and it should be possible to implement it without much redesign. We might even consider adding this feature by default (i.e. without requiring The key thing to think of is how to design:
I assume that what we should allow is ADDING new columns only, and not replacing old columns ( |
Actually maybe we could allow replacing columns - we need to list all possible operations in the new issue and decide which should work and how. |
Fwiw, Stata doesn't allow replacing columns, as in you can't set everything else to But we should look into this more seriously in another issue, and document Stata's behavior. |
The same is done by data.table (i.e. you cannot set |
This pull request fleshes out a
WhereDataFrame
type, constructed usingwhere
.filter/filter!
,delete!
,transform/transform!
,combine
,view
operate onWhereDataFrames
It is similar in spirit to SQL where, data.table i, or Stata if.
PR would solve #2354, #2323, #2211