-
-
Notifications
You must be signed in to change notification settings - Fork 426
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Initial prototype of plot_lm #1727
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At a high level this looks good, Please start adding tests for functionality so we can more fully shake out bugs
arviz/plots/lmplot.py
Outdated
if y_ppc is None: | ||
y_ppc = y_var_name | ||
|
||
y_model_values=0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would consider splitting this into another function and testing it independently
The high level design api doesnt seem to match whats in the code but i generally like the way its stated here. It would be nice if the idata was the only "data" argument, and the other arguments are specifiers for sets of data in the idata object For the model specification instead of parsing it ourselves should we use patsy instead, and perhaps add it an optional dependency? @ahartikainen for a double check here Overall the plot looks nice. For next steps I think we should
|
I'm not 100% we should parse formulas. I think everything should be in the InferenceData and we call it. For cases where we want to calculate the model mean, user given function is probably a safer option. |
I'm 90% sure we don't want to parse formulas at all, 100% sure we don't want to create any kind of parser (not only because it would be a lot of work and could even eat up the whole GSoC or even more time but also because we'd then have to maintain it!). AFAIK all formula parsing libraries we could depend on use dataframes as input which we don't have so we'd need to convert the data to a correctly formatted dataframe for things to work, and for formula parsing to be really useful we would not only need to parse the formula but also consider link functions for example. I think having a lower level thing that takes samples/functions is preferrable, and it should be easy for libraries like bambi to build on top of that and offer a formula based api to users.
You can take a look at the refitting notebooks that use xarray for examples on a very similar computation. |
Do you mean that we should accept |
Even the y_model? For example |
Just to add my 2 cents, if we dont need to parse formulas I dont think we should either. Itll add an extra dependency to ArviZ, and in the modeling workflow it will be very annoying to restate a complicated model in a formula when you've already done it in a PPL. Im not sure if we can get away without it but glad @OriolAbril and @ahartikainen chimed in saying we probably dont need it! |
Sure then! Assuming the y_model is present in inferenceData is looking a good option. |
Not completely sure what this means. Are you talking about, for now, taking all of the necessary values as references to precomputed DataArrays, stored in idata. And later, extending it to take an equation string and let bambi handle the equation. |
As suggested by everyone, what I have come up with is very simple than what I expected.
Function call :
We can then move forward to reduce input arguments by making certain assumptions(specially for the I agree on the comments made here and on slack regarding parsing the formula and totally convinced that making a parser is not a good idea. |
Looking good! Youre gonna get so sick of me saying this. Cant wait to see the function implementation and some tests so we can try it out :D |
b44ed06
to
445c9c0
Compare
I have added a basic test. Not 100% sure if it's the right approach. Actually, there was no pre-built linear regression model so created one in the test function itself. Should it go in the helper? |
Codecov Report
@@ Coverage Diff @@
## main #1727 +/- ##
==========================================
+ Coverage 90.87% 91.07% +0.20%
==========================================
Files 109 112 +3
Lines 11844 12094 +250
==========================================
+ Hits 10763 11015 +252
+ Misses 1081 1079 -2
Continue to review full report at Codecov.
|
Looks like a good start to the test. Your instinct is right I would move the "model" to helpers. I also would suggest instead of creating groups directly to use the inferencedata interface so if that changes itll be reflected in this test https://github.com/arviz-devs/arviz/blob/main/arviz/tests/helpers.py#L22 I wont have time for a full review just this moment but will get to it in next 24 to 72 hours. |
I would try to get something like:
facetting/iterating to generate the multiple subplots is going to be challenging and therefore I think we should take that into account (or a specific subset we want to support) from the beginning, so we don't define somehting for 1 subplot case and then try to force facetting on it. Regarding the group arguments, I strongly feel we should not have them, for several reasons: variables should have a group where they belong and having that seems to indicate they don't; we don't have that in any of the similar plots (i.e. the group argument in plot_ppc is posterior and prior yet we never get samples from either group there, prior means use data from prior_predictive and observed_data, same for posterior, we could have this here too, but not a group per variable); it's not difficult to move variables from one group to another (we might need to improve docs though) and it should not duplicate memory. I am also unsure about the hdi for the mean. Most people I have seen use that were doing so inadvertedly thinking it was the posterior predictive hdi. What information is that part supposed to transmit? I understand and know how to interpret spagetti plots of either pp samples directly or regression lines but I don't really know how to interpret these mean hdi ones so everything might come from my own ignorance. |
I really like this approach. Thanks for putting this.
If optional, would we be plotting just x and y?
So, then we would assume variables to be present in a certain group. If any of the variable is not found in that particular group, we should not search in other groups and return immediately with an error.? or should we search in other groups too? |
x, y and y_ppc (which we should probably call
Yes, I think it will make the code much simpler and easier to maintain with little to no drawback. You have the x in observed_data instead of constant_data? use |
According to me, It depicts the uncertainty of mean line just like spagetti plots do, but instead by plotting credible interval. We could do it like following as well. About y_ppc we can plot posterior_predictive samples in different colour than observed values. |
It is possible to plot the uncertainty three ways using an arguments? Two of the ways you've shown, either an hdi band, or samples from the posterior predictive. The third idea is a band but where the opacity changes as a function of the HDI. this last one dont implement in this PR just throwing the idea out there. I can see different situations where a user would want the first two, so being able to switch, or plot both together, with an argument would be nice. The third Im less sure about but what to suggest it now while were in the early PR phase, although I want to stress again don't implement the third one. If method is made flexible enough to switch between points and band uncertainty representation we can test the third idea later |
1a36bbd
to
ba005ec
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added couple of comments
_, ax = plt.subplots(1, 1, figsize=(6, 4), dpi=100) | ||
|
||
# plot data | ||
ax.plot(x, y, marker=".", color="C3", lw=0, zorder=10) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At some point these could be used with kwargs (with some setdefault magic)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this remains to be done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Different kwargs for observed data, PPC and model?
y_model_mean_ = y_model.stack(sample=("chain", "draw"))[..., slicer].mean("sample") | ||
ax.plot(x, y_model_mean_, color="y", lw=0.8, zorder=11) | ||
|
||
for spine in ax.spines.values(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this also needs some thinking when we want to do this and when not
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes , I have just taken it from from your function. I need to think about it because I have less idea where is it useful.
|
||
for spine in ax.spines.values(): | ||
spine.set_visible(False) | ||
ax.grid(True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should also be some kwargs
@OriolAbril I have written some code for it. Can you share a model to test this argument? |
a0fc367
to
918682b
Compare
I have added some tests. Function is now working fine for multidimensional y as well as x. Now some labelling and touch-ups are required so that graph looks nice. It gets complicated at some places in user API, which I think need some review because there may be better ways to process the inputs. Thank you for being so helpful so far. 😄 |
f081feb
to
4dba441
Compare
Overall this code looks great and I like the plots. This is fantastic add to the ArviZ toolset. Youre nearly most of the way there just some comments below. Need to fix or address commentsReturned object should always be a plotThis examples from the docs returns a numpy array. It should return a plot
Extra code documentationThere are a couple code comments that are very helpful. Add a couple more like the one noted above. nothing fancy, just one liners on major sections of logic. If youd like i can note where to help out Nitpicky Plot commentThis doesnt need to be fixed in this pr but the legend will need a box around it. Its really hard to read otherwise. I know we dont need this across all plots but I think we really should as some of our other plots suffer as well. Future PRs: Lots of DocumentationFirst off thanks for adding both a complete api documentation, code examples, and tests. It is a fantastic start. Some of the arguments are not particularly intuitive, which is not you fault just the nature of the best, so documentation will help. In a followup PR can you add docs that show the following? Perhaps start with a notebook for prototyping and we can see how complicated it is and decide the best final home for the code Things that will help are
|
Also this branch might need a rebase on main to fix the failing tests. Once we have passing tests we can assess the test coverage as well |
This one is interesting. This example is plotting multiple plots. So, it's returning an array of plots. Checked it for other plots like plot_trace, it too returns NumPy array. But it is even returning an array of shape (1,1) in case there is only one plot. It should return a plot, not an array of 1 plot. This needs some thinking, why is it happening |
Oh youre right ignore my comment, no need to return a 1d array. I apologize my mistake there |
81de065
to
f6ab7af
Compare
arviz/data/io_pyro.py:111:20: E1102: pyro.poutine.trace is not callable (not-callable) Is it an error from pyro's side? |
Yea this is a linting error from pyro. Did you rebase on master? I think all the issues are fixed on master? It was passing as of 13 days ago, not sure if this is a new issue |
Yes, I did. I think this is some new error. Other PR's are also failing. |
Master is fixed. Please rebase or merge so we can see if CI passes |
f93d9f9
to
bef54eb
Compare
All green fInally!! Good to see passed tests after a long time xD. Thanks to all who worked to fix CI! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comments don't necessarily need to be addressed in this PR, but they need to be addressed at some point.
This doesnt need to be fixed in this pr but the legend will need a box around it. Its really hard to read otherwise. I know we dont need this across all plots but I think we really should as some of our other plots suffer as well.
https://arviz-devs.github.io/arviz/examples/plot_parallel_rank.html
This may actually an issue with the arviz-darkgrid theme: https://github.com/arviz-devs/arviz/blob/main/arviz/plots/styles/arviz-darkgrid.mplstyle#L11. Do we only want this plot to have a box? Or is it something more general that should be handled by the theme?
This may also be related to the grid parameter, I have focused on docstring only and don't know how it works, but it may be there only to override the value set by the theme.
This doesn't mean that we can't override theme set values, if we think this plot should always/never have a grid/legend box independently of the theme for example, just wanted to comment that so we are all aware.
arviz/plots/lmplot.py
Outdated
y : str or DataArray or ndarray | ||
If str, variable name from observed_data | ||
idata : az.InferenceData object, Optional | ||
Optional only if y is Sequence |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sequence is not a valid type according to y
docstring, so I'm not sure what this means.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If any of y, x, y_hat, y_model is str, then idata is necessary. Yea, This line is unclear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This explanation should be the one on the docstring! It's different from the current one too which seeing this looks wrong
arviz/plots/lmplot.py
Outdated
backend : str, Optional | ||
Select plotting backend {"matplotlib","bokeh"}. Default "matplotlib". | ||
y_kwargs : dict, optional | ||
Passed to plot() in matplotlib and circle() in bokeh |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you make plot and circle hyperlinks using intersphinx? same for all other kwargs.
see https://arviz-devs.github.io/arviz/contributing/developer_guide.html#hyperlinks and #1188 for guidance and examples.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
bef54eb
to
e222717
Compare
arviz/plots/lmplot.py
Outdated
num_pp_samples : int, Optional, Default 50 | ||
kind_pp : {"samples", "hdi"}, Default "samples" | ||
kind_model : {"lines", "hdi"}, Default "lines" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these need an explanation, even if a single sentence.
arviz/plots/lmplot.py
Outdated
Select plotting backend {"matplotlib","bokeh"}. Default "matplotlib". | ||
y_kwargs : dict, optional | ||
Passed to :meth:`mpl:matplotlib.axes.Axes.plot` in matplotlib | ||
and :meth:`bokeh:bokeh.plotting.figure.Figure.circle` in bokeh |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
according to sphinxinv, the objects in bokeh docs containing circle are these:
:py:class:`bokeh.models.glyphs.Circle`
:py:attribute:`bokeh.models.glyphs.Circle.x`
:py:attribute:`bokeh.models.glyphs.Circle.y`
:py:function:`bokeh.models.markers.Circle`
:py:function:`bokeh.models.markers.CircleX`
:py:function:`bokeh.models.markers.CircleY`
:py:method:`bokeh.plotting.Figure.circle`
:py:method:`bokeh.plotting.Figure.circle_x`
:py:method:`bokeh.plotting.Figure.circle_y`
:py:method:`bokeh.plotting.GMap.circle`
:py:method:`bokeh.plotting.GMap.circle_x`
:std:label:`bokeh.models.glyphs.circle`
:std:label:`bokeh.models.markers.circle`
:std:label:`bokeh.models.markers.circledot`
:std:label:`bokeh.models.markers.circlex`
:std:label:`bokeh.models.markers.circley`
the one you used is not an option, maybe it was in an old version?
and :meth:`bokeh:bokeh.plotting.figure.Figure.line` in bokeh | ||
y_model_fill_kwargs : dict, optional | ||
Passed to az.plot_hdi() | ||
backend_kwargs : dict, optional |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this also needs some explanation as well as the intersphinx references.
arviz/plots/lmplot.py
Outdated
Passed to :meth:`mpl:matplotlib.axes.Axes.plot` in matplotlib | ||
and :meth:`bokeh:bokeh.plotting.Figure.circle` in bokeh | ||
y_hat_fill_kwargs : dict, optional | ||
Passed to az.plot_hdi() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Passed to az.plot_hdi() | |
Passed to {func}`~arviz.plot_hdi` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know how I missed this one!
arviz/plots/lmplot.py
Outdated
These are kwargs specific to the backend being used. For additional documentation | ||
check the plotting method of the backend. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the links to the methods where these are passed are still missing, you can take them from #1728 to avoid having to track them down in the source code.
f715362
to
bc9c18e
Compare
Description
This PR and #1747 summarizes my work done in Google Summer of Code 2021. I added 2 plots,
plot_lm
andplot_ts
. The first one is covered in this PR and the next one in #1747. The project workflow was like this:Started with a high-level design: Discussed the user API function with the mentors and other community members. Discussion includes input variable names, accepted datatypes, accepted values and other input details. Also showed a sample output visualization. Once it was approved, I moved on to opening this PR. This Github Gist shows the design and some discussions related to it. The design process took an initial week.
Submitted a prototype: This PR was earlier a prototype. This step is basically implementing the design decision made in the previous step. In addition to the user API, backend functions are also added. As ArviZ uses 2 backends, artist functions for both of them were added. This step took another week.
Improved it according to reviews: In this step, mentors review your code to provide feedback and improvement tips. Learned the best code practices in this step. Ideally, improving never ends. It is important to maintain the added code after it is merged to keep it bug-free. I aim to provide support after GSoC. This step was a bit lengthy and complex and thus took 2 weeks.
Tested the functionality: Added unit tests using
pytest
. Aimed and achieved to cover all of the added functionality under tests. Similar to step 2, added tests for both of the backends. Also, solved some simple to complex bugs that arise while testing. This step took another week.Added examples: Examples are added to the docstring as well as to the examples folder. Please check out the files changed tab to know more about this step. This step was quick, took only half a week.
Added documentation: If you want to know how to use
plot_lm
, checkout out this blog. However, if you want to go on a low level to know the working in detail, I would suggest taking a look over the docstring in ArviZ docs and probably follow the comments sequentially. Another week was consumed in this step.Output :

Checklist