Skip to content

Subset arrays #411

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Merged
merged 24 commits into from
May 6, 2025
Merged

Subset arrays #411

merged 24 commits into from
May 6, 2025

Conversation

eodole
Copy link
Collaborator

@eodole eodole commented Apr 14, 2025

Addresses Issue #278

Implemented Take and subsample methods, however as discussed with Lars the requested squeeze functionality is essentially just the inverse of expand dims

@eodole eodole requested a review from LarsKue April 14, 2025 16:47
Copy link
Contributor

@LarsKue LarsKue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR! The core of these changes already looks good, but there are some things I would like to see changed. Please also change the base branch of the PR to dev. See here on how to do that. EDIT: I could change it myself 🙂

@@ -39,3 +39,6 @@ docs/

# MacOS
.DS_Store

# Rproj
.Rproj.user
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am unfamiliar with R. What is this directory used for, and should all other users have it ignored too? Otherwise, please put this in your local .git/info/exclude instead.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

according to @stefanradev93, this should be .Rproj

)
assert not np.all(
np.diff(output, axis=i) > 0
), f"is ordered along axis which is not meant to be ordered: {i}."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure why this is being reordered now, are you running ruff version 0.11.2?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

apparently I was running ruff 0.8.1 but I will update it

def __init__(self):
super().__init__()

def forward(self, data: np.ndarray, sample_size: int, axis=-1):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, the sample_size should be part of the constructor for this transform. It seems like a bit of a hassle to have to pass this argument in the forward call of the Adapter, unless you have a specific use case in mind?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also allow it to be a float in [0, 1] specifying a proportion of the sample to subsample.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I second this, but it raises the concern of whether we should floor or ceil the resulting value. I am thinking ceiling should be better.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what case do you imagine for use of ceiling over floor? For most applications I assume users would want a slightly smaller rather than larger sample size.

@LarsKue LarsKue changed the base branch from main to dev April 15, 2025 15:43
@LarsKue LarsKue added feature New feature or request user interface Changes to the user interface and improvements in usability good first issue Good for first-time contributors labels Apr 15, 2025
@LarsKue LarsKue moved this from Future to In Progress in bayesflow development Apr 15, 2025
Copy link
Collaborator Author

@eodole eodole left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm honestly not sure that this should be a map transform as written, one consequence of the way its written now is that all keys specified by this transform will have random subsamples of the same size from the same axis. I think that all datasets that a user wants to subsample should be specified individually, so that axis and sample size are specified individually. If this is the case, would it not be better to force the map transform to reject a sequence of keys but rather only take one key?

@eodole
Copy link
Collaborator Author

eodole commented Apr 22, 2025

You also asked me to rename subsample_array to random_subsample my question is should all associated files also be renamed?

@stefanradev93 stefanradev93 deleted the branch bayesflow-org:dev April 22, 2025 14:37
@github-project-automation github-project-automation bot moved this from In Progress to Done in bayesflow development Apr 22, 2025
@LarsKue
Copy link
Contributor

LarsKue commented Apr 22, 2025

This was accidentally closed. We will investigate how to restore the branch and reopen PRs.

@LarsKue LarsKue reopened this Apr 22, 2025
@LarsKue
Copy link
Contributor

LarsKue commented Apr 22, 2025

Yes, please rename all associated files so the structure of file_name.py::class_or_function_name is consistent.

to force the map transform to reject a sequence of keys

In this case, I think we would want to have a regular Transform. It would also be fine to implement it as an ElementwiseTransform and wrap it in a MapTransform internally, but then the dispatch method on the adapter, i.e., adapter.random_subsample should raise an error if a Sequence of keys is passed.

@LarsKue LarsKue self-requested a review May 2, 2025 17:19
Copy link
Contributor

@LarsKue LarsKue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for addressing my previous comments. Before we can merge this into dev, it looks like you will need to resolve merge-conflicts by merging dev into this first. If you need help with that, let me know.





Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These empty lines would be removed by the formatter, which should automatically run on pre-commit.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i made a new environment based on the contribution.md
and now the linter runs automatically. on my end the most recent commit it passed all checks so im not sure why it's failing on the pull request

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Try running pre-commit run --all-files manually once.

Copy link

codecov bot commented May 6, 2025

Codecov Report

Attention: Patch coverage is 92.59259% with 4 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
bayesflow/adapters/transforms/random_subsample.py 88.00% 3 Missing ⚠️
bayesflow/adapters/adapter.py 90.00% 1 Missing ⚠️
Files with missing lines Coverage Δ
bayesflow/adapters/transforms/__init__.py 100.00% <100.00%> (ø)
bayesflow/adapters/transforms/take.py 100.00% <100.00%> (ø)
bayesflow/adapters/adapter.py 84.01% <90.00%> (-0.68%) ⬇️
bayesflow/adapters/transforms/random_subsample.py 88.00% <88.00%> (ø)

... and 20 files with indirect coverage changes

@eodole
Copy link
Collaborator Author

eodole commented May 6, 2025

I have no idea why the linter failed, i have it in the precommit hook and i merged dev into this branch. Im not really sure why it doesnt pass

@eodole eodole requested a review from LarsKue May 6, 2025 12:29
@LarsKue
Copy link
Contributor

LarsKue commented May 6, 2025

@eodole I did some clean-up to make the linter/formatter pass and I adjusted the transforms, docs, and tests to what I think your intention was. Please double-check.

The tests are still failing, however, due to some incorrect shape broadcasting. Can you take care of fixing this?

Copy link
Contributor

@LarsKue LarsKue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment above.

@LarsKue
Copy link
Contributor

LarsKue commented May 6, 2025

I just realized that it's simply an issue with the combination of the tests and new transforms, since the new transforms are non-deterministic and non-invertible, but the tests assume invertibility and determinism at some points. We are now skipping these checks for the new transforms, for now.

I also implemented the serialization pipeline for the transforms, so all tests should pass now. Thanks for the PR!

@LarsKue LarsKue self-requested a review May 6, 2025 17:30
@LarsKue LarsKue merged commit b4d0a72 into bayesflow-org:dev May 6, 2025
9 checks passed
stefanradev93 added a commit that referenced this pull request Jun 17, 2025
* Subset arrays (#411)

* made initial backend functions for adapter subsetting, need to still make the squeeze function and link it to the front end

* added subsample functionality, to do would be adding them to testing procedures

* made the take function and ran the linter

* changed name of subsampling function

* changed documentation, to be consistent with external notation, rather than internal shorthand

* small formation change to documentation

* changed subsample to have sample size and axis in the constructor

* moved transforms in the adapter.py so they're in alphabetical order like the other transforms

* changed random_subsample to maptransform rather than filter transform

* updated documentation with new naming convention

* added arguments of take to the constructor

* added feature to specify a percentage of the data to subsample rather than only integer input

* changed subsample in adapter.py to allow float as an input for the sample size

* renamed subsample_array and associated classes/functions to RandomSubsample and random_subsample respectively

* included TypeError to force users to only subsample one dataset at a time

* ran linter

* rerun formatter

* clean up random subsample transform and docs

* clean up take transform and docs

* nitpick clean-up

* skip shape check for subsampled adapter transform inverse

* fix serialization of new transforms

* skip randomly subsampled key in serialization consistency check

---------

Co-authored-by: LarsKue <lars@kuehmichel.de>

* [no ci] docs: start of user guide - draft intro, gen models

* [no ci] add draft for data processing section

* [no ci] user guide: add stub on summary/inference networks

* [no ci] user guide: add stub on additional topics

* [no ci] add early stage disclaimer to user guide

* pin dependencies in docs, fixes snowballstemmer error

* fix: correct check for "no accepted samples" in rejection_sample

Closes #466

* Stabilize MultivariateNormalScore by constraining initialization in PositiveDefinite link (#469)

* Refactor fill_triangular_matrix

* stable positive definite link, fix for #468

* Minor changes to docstring

* Remove self.built=True that prevented registering layer norm in build()

* np -> keras.ops

* Augmentation (#470)

* Remove old rounds data set, add documentation, and augmentation options to data sets

* Enable augmentation to parts of the data or the whole data

* Improve doc

* Enable augmentations in workflow

* Fix silly type check and improve readability of for loop

* Bring back num_batches

* Fixed log det jac computation of standardize transform

y = (x - mu) / sigma
log p(y) = log p(x) - log(sigma)

* Fix fill_triangular_matrix

The two lines were switched, leading to performance degradation.

* Deal with inference_network.log_prob to return dict (as PointInferenceNetwork does)

* Add diffusion model implementation (#408)

This commit contains the following changes (see PR #408 for discussions)

- DiffusionModel following the formalism in Kingma et. al (2023) [1]
- Stochastic sampler to solve SDEs
- Tests for the diffusion model

[1] https://arxiv.org/abs/2303.00848

---------

Co-authored-by: arrjon <jonas.arruda@uni-bonn.de>
Co-authored-by: Jonas Arruda <69197639+arrjon@users.noreply.github.com>
Co-authored-by: LarsKue <lars@kuehmichel.de>

* [no ci] networks docstrings: summary/inference network indicator (#462)

- From the table in the `bayesflow.networks` module overview, one cannot
  tell which network belongs to which group. This commit adds short
  labels to indicate inference networks (IN) and summary networks (SN)

* `ModelComparisonSimulator`: handle different outputs from individual simulators (#452)

Adds option to drop, fill or error when different keys are encountered in the outputs of different simulators. Fixes #441.

---------

Co-authored-by: Valentin Pratz <git@valentinpratz.de>

* Add classes and transforms to simplify multimodal training (#473)

* Add classes and transforms to simplify multimodal training

- Add class `MultimodalSummaryNetwork` to combine multiple summary
  networks, each for one modality.
- Add transforms `Group` and `Ungroup`, to gather the multimodal inputs
  in one variable (usually "summary_variables")
- Add tests for new behavior

* [no ci] add tutorial notebook for multimodal data

* [no ci] add missing training argument

* rename MultimodalSummaryNetwork to FusionNetwork

* [no ci] clarify that the network implements late fusion

* allow dispatch of summary/inference network from type

* add tests for find_network

* Add squeeze transform

Very basic transform, just the inverse of expand_dims

* [no ci] fix examples in ExpandDims docstring

* squeeze: adapt example, add comment for changing batch dims

* Permit Python version 3.12 (#474)

Allow Python version 3.12 after successful CI run: https://github.com/bayesflow-org/bayesflow/actions/runs/14988542031

* Change order in readme and reference new book [skip ci]

* make docs optional dependencies compatible with python 3.10

* Add a custom `Sequential` network to avoid issues with building and serialization in keras (#493)

* add custom sequential to fix #491

* revert using Sequential in classifier_two_sample_test.py

* Add docstring to custom Sequential

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* fix copilot docstring

* remove mlp override methods

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Add Nnpe adapter class (#488)

* Add NNPE adapter

* Add NNPE adapter tests

* Only apply NNPE during training

* Integrate stage differentiation into tests

* Improve test coverage

* Fix inverse and add to tests

* Adjust class name and add docstring to forward method

* Enable compatibility with #486 by adjusting scales automatically

* Add dimensionwise noise application

* Update exception handling

* Fix tests

* Align diffusion model with other inference networks and remove deprecation warnings (#489)

* Align dm implementation with other networks

* Remove deprecation warning for using subnet_kwargs

* Fix tests

* Remove redundant training arg in get_alpha_sigma and some redundant comments

* Fix configs creation - do not get base config due to fixed call of super().__init__()

* Remove redundant training arg from tests

* Fix dispatch tests for dms

* Improve docs and mark option for x prediction in literal

* Fix start/stop time

* minor cleanup of refactory

---------

Co-authored-by: Valentin Pratz <git@valentinpratz.de>

* add replace nan adapter (#459)

* add replace nan adapter

* improved naming

* _mask as additional key

* update test

* improve

* fix serializable

* changed name to return_mask

* add mask naming

* [no ci] docs: add basic likelihood estimation example

Fixes #476. This is the barebones version showing the technical steps to
do likelihood estimation. Adding more background and motivation would be
nice.

* make metrics serializable

It seems that metrics do not store their state, I'm not sure yet if this
is intended behavior.

* Remove layer norm; add epsilon to std dev for stability of pos def link

this breaks serialization of point estimation with MultivariateNormalScore

* add end-to-end test for fusion network

* fix: ensure that build is called in FusionNetwork

* Correctly track train / validation losses (#485)

* correctly track train / validation losses

* remove mmd from two moons test

* reenable metrics in continuous approximator, add trackers

* readd custom metrics to two_moons test

* take batch size into account when aggregating metrics

* Add docs to backend approximator interfaces

* Add small doc improvements

* Fix typehints to docs.

---------

Co-authored-by: Valentin Pratz <git@valentinpratz.de>
Co-authored-by: stefanradev93 <stefan.radev93@gmail.com>

* Add shuffle parameter to datasets

Adds the option to disable data shuffling

---------

Co-authored-by: Lars <lars@kuehmichel.de>
Co-authored-by: Valentin Pratz <git@valentinpratz.de>

* fix: correct vjp/jvp calls in FreeFormFlow

The signature changed, making it necessary to set return_output=True

* test: add basic compute_metrics test for inference networks

* [no ci] extend point approximator tests

- remove skip for MVN
- add test for log-prob

* [no ci] skip unstable MVN sample test again

* update README with more specific install instructions

* fix FreeFormFlow: remove superfluous index form signature change

* [no ci] FreeFormFlow MLP defaults: set dropout to 0

* Better pairplots (#505)

* Hacky fix for pairplots

* Ensure that target sits in front of other elements

* Ensure consistent spacing between plot and legends + cleanup

* Update docs

* Fix the propagation of `legend_fontsize`

* Minor fix to comply with code style

* [no ci] Formatting: escaped space only in raw strings

* [no ci] fix typo in error message, model comparison approximator

* [no ci] fix: size_of could not handle basic int/float

Passing in basic types would lead to infinite recursion. Checks for
other types than int and float might be necessary as well.

* add tests for model comparison approximator

* Generalize sample shape to arbitrary N-D arrays

* [WIP] Move standardization into approximators and make adapter stateless. (#486)

* Add standardization to continuous approximator and test

* Fix init bugs, adapt tnotebooks

* Add training flag to build_from_data

* Fix inference conditions check

* Fix tests

* Remove unnecessary init calls

* Add deprecation warning

* Refactor compute metrics and add standardization to model comp

* Fix standardization in cont approx

* Fix sample keys -> condition keys

* amazing keras fix

* moving_mean and moving_std still not loading [WIP]

* remove hacky approximator serialization test

* fix building of models in tests

* Fix standardization

* Add standardizatrion to model comp and let it use inheritance

* make assert_models/layers_equal more thorough

* [no ci] use map_shape_structure to convert shapes to arrays

This automatically takes care of nested structures.

* Extend Standardization to support nested inputs (#501)

* extend Standardization to nested inputs

By using `keras.tree.flatten` und `keras.tree.pack_sequence_as`, we can
support arbitrary nested structures. A `flatten_shape` function is
introduced, analogous to `map_shape_structure`, for use in the build
function.

* keep tree utils in submodule

* Streamline call

* Fix typehint

---------

Co-authored-by: stefanradev93 <stefan.radev93@gmail.com>

* Update moments before transform and update test

* Update notebooks

* Refactor and simplify due to standardize

* Add comment for fetching the dict's first item, deprecate logits arg and fix typehint

* add missing import in test

* Refactor preparation of data for networks and new point_appr.log_prob

* ContinuousApproximator._prepare_data unifies all preparation in
  sample, log_prob and estimate for both ContinuousApproximator and
  PointApproximator
* PointApproximator now overrides log_prob

* Add class attributes to inform proper standardization

* Implement stable moving mean and std

* Adapt and fix tests

* minor adaptations to moving average (update time, init)

We should put the update before the standardization, to use the maximum
amount of information available. We can then also initialize the moving
M^2 with zero, as it will be filled immediately.

The special case of M^2 = 0 is not problematic, as no variance
automatically indicates that all entries are equal, and we can set
them to zero  (see my comment).

I added another test case to cover that case, and added a test for the
standard deviation to the existing test.

* increase tolerance of allclose tests

* [no ci] set trainable to False explicitly in ModelComparisonApproximator

* point estimate of covariance compatible with standardization

* properly set values to zero if std is zero

Cases for inf and -inf were missing

* fix sample post-processing in point approximator

* activate tests for multivariate normal score

* [no ci] undo prev commit: MVN test still not stable, was hidden by std of 0

* specify explicit build functions for approximators

* set std for untrained standardization layer to one

An untrained layer thereby does not modify the input.

* [no ci] reformulate zero std case

* approximator builds: add guards against building networks twice

* [no ci] add comparison with loaded approx to workflow test

* Cleanup and address building standardization layers  when None specified

* Cleanup and address building standardization layers when None specified 2

* Add default case for std transform and add transformation to doc.

* adapt handling of the special case M^2=0

* [no ci] minor fix in concatenate_valid_shapes

* [no ci] extend test suite for approximators

* fixes for standardize=None case

* skip unstable MVN score case

* Better transformation types

* Add test for both_sides_scale inverse standardization

* Add test for left_side_scale inverse standardization

* Remove flaky test failing due to sampling error

* Fix input dtypes in inverse standardization transformation_type tests

* Use concatenate_valid in _sample

* Replace PositiveDefinite link with CholeskyFactor

This finally makes the MVN score sampling test stable for the jax backend,
for which the keras.ops.cholesky operation is numerically unstable.

The score's sample method avoids calling keras.ops.cholesky to resolve
the issue. Instead the estimation head returns the Cholesky factor
directly rather than the covariance matrix (as it used to be).

* Reintroduce test sampling with MVN score

* Address TODOs and adapt docstrings and workflow

* Adapt notebooks

* Fix in model comparison

* Update readme and add point estimation nb

---------

Co-authored-by: LarsKue <lars@kuehmichel.de>
Co-authored-by: Valentin Pratz <git@valentinpratz.de>
Co-authored-by: Valentin Pratz <112951103+vpratz@users.noreply.github.com>
Co-authored-by: han-ol <g@hans.olischlaeger.com>
Co-authored-by: Hans Olischläger <106988117+han-ol@users.noreply.github.com>

* Replace deprecation with FutureWarning

* Adjust filename for LV

* Fix types for subnets

* [no ci] minor fixes to RandomSubsample transform

* [no ci] remove subnet deprecation in cont-time CM

* Remove empty file [no ci]

* Revert layer type for coupling flow [skip ci]

* remove failing import due to removed find_noise_schedule.py [no ci]

* Add utility function for batched simulations (#511)

The implementation is a simple wrapper leveraging the batching
capabilities of `rejection_sample`.

* Restore PositiveDefinite link with deprecation warning

* skip cycle consistency test for diffusion models

- the test is unstable for untrained diffusion models, as the networks
  output is not sufficiently smooth for the step size we use
- remove the diffusion_model marker

* Implement changes to NNPE adapter for #510 (#514)

* Move docstring to comment

* Always cast to _resolve_scale

* Fix typo

* [no ci] remove unnecessary serializable decorator on rmse

* fix type hint in squeeze [no ci]

* reintroduce comment in jax approximator [no ci]

* remove unnecessary getattr calls [no ci]

* Rename local variable transformation_type

* fix error type in diffusion model [no ci]

* remove non-functional per_training_step from plots.loss

* Update doc [skip ci]

* rename approximator.summaries to summarize with deprecation

* address remaining comments

---------

Co-authored-by: Leona Odole <88601208+eodole@users.noreply.github.com>
Co-authored-by: LarsKue <lars@kuehmichel.de>
Co-authored-by: Valentin Pratz <git@valentinpratz.de>
Co-authored-by: Hans Olischläger <106988117+han-ol@users.noreply.github.com>
Co-authored-by: han-ol <g@hans.olischlaeger.com>
Co-authored-by: Valentin Pratz <112951103+vpratz@users.noreply.github.com>
Co-authored-by: arrjon <jonas.arruda@uni-bonn.de>
Co-authored-by: Jonas Arruda <69197639+arrjon@users.noreply.github.com>
Co-authored-by: Simon Kucharsky <kucharssim@gmail.com>
Co-authored-by: Daniel Habermann <133031176+daniel-habermann@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Lasse Elsemüller <60779710+elseml@users.noreply.github.com>
Co-authored-by: Jerry Huang <57327805+jerrymhuang@users.noreply.github.com>
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
feature New feature or request good first issue Good for first-time contributors user interface Changes to the user interface and improvements in usability
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

3 participants