Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

add extract_dataset function #1725

Merged
merged 6 commits into from
Jan 19, 2022
Merged

add extract_dataset function #1725

merged 6 commits into from
Jan 19, 2022

Conversation

OriolAbril
Copy link
Member

@OriolAbril OriolAbril commented Jun 12, 2021

Description

Fixes #1469. I think this will cover most casesin the issue and several
more common enough to be relevant like random subsetting for plotting for example.

I probably won't be able to work on this and finish it in the immediate future, if we want this to
be included in the next release someone else should take over, add tests and some examples in the
docstring. If interested comment and go for it, feel free to open a pr to this branch, create a
new PR that builds on top of this work, or work directly on this (if you have permissions to do so).

Checklist

  • Does the PR follow official PR format?
  • Is the new feature properly documented with an example?
  • Does the PR include new or updated tests to cover the new feature (using pytest fixture pattern)?
  • Is the code style correct (follows pylint and black guidelines)?
  • Is the new feature listed in the New features
    section of the changelog?

@ahartikainen
Copy link
Contributor

This function should be able to combine data-arrays from different groups if needed?

@OriolAbril
Copy link
Member Author

that could be cool too, how would you indicate that? instead of group and var_names arguments have a group_vars one that is a dict of group names as keys lists of variables names as values?

@ahartikainen
Copy link
Contributor

That sounds reasonable option to go with.

arviz/utils.py Outdated
if rng is not False:
if rng is True:
rng = np.random.default_rng()
elif isinstance(rng, int):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should use isinstance(rng, numbers.Integral)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

modified this to a try except, turns out default_rng also takes sequence of integers plus maybe someone uses a subclassed generator. Catched two errors to hopefully provide more informative errors.

arviz/utils.py Outdated
rng = np.random.default_rng(rng)
if not isinstance(rng, np.random.Generator):
raise ValueError("Could not interpret rng as np.random.Generator")
random_subset = rng.permutation(np.arange(len(data["__sample__"])))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this still be sorted? (Imagine each chain is stuck --> this problem can still be seen even if rng is used)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(see comment about stack routine --> if true, then this comment is not relevant)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The results are only sorted if

  • all samples are returned, stacked or not. I assumed in this case, users are keeping all the samples to use all of them, not to cut them right after that. But we can change the default for the stacked case though.
  • rng=False explicitly (not the default)

In all other cases, the result is in this random_subset order. For example:

In [2]: az.extract_dataset(idata, combined=True, var_names="home", num_samples=4)
Out[2]: 
<xarray.Dataset>
Dimensions:  (sample: 4)
Coordinates:
  * sample   (sample) MultiIndex
  - chain    (sample) int64 3 2 1 0
  - draw     (sample) int64 36 390 375 496
Data variables:
    home     (sample) float64 0.1911 0.1604 0.1791 0.1109
Attributes:
    created_at:                 2019-07-12T20:31:53.545143
    inference_library:          pymc3
    inference_library_version:  3.7

arviz/utils.py Outdated
random_subset = rng.permutation(np.arange(len(data["__sample__"])))
data = data.isel(__sample__=random_subset)
if num_samples is not None:
data = data.isel(__sample__=slice(None, num_samples))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we do a bit more complex thing here?

Instead of returning n first samples, we could return n/nchain samples from each chain?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh wait, due to our stack routine it is already doing it?

@OriolAbril
Copy link
Member Author

will review and update this week

@OriolAbril
Copy link
Member Author

Should be ready to merge now, but still happy to have reviews!

Copy link
Contributor

@aloctavodia aloctavodia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Remember to update the changelog

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@OriolAbril OriolAbril merged commit a67d81c into main Jan 19, 2022
@OriolAbril OriolAbril deleted the stack_util branch January 19, 2022 03:25
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add helper function to easily stack chain and draws dimensions
3 participants