Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

map_over_datasets: skip empty nodes #10042

Open
wants to merge 15 commits into
base: main
Choose a base branch
from

Conversation

mathause
Copy link
Collaborator

@mathause mathause commented Feb 10, 2025


  • misses tests and docs but I'd like to get some feedback first
  • needs some add. logic to only check the output on non-empty nodes and to ensure multi-output functions are correct
  • no good way for a proper deprecation without a keyword

@mathause mathause marked this pull request as draft February 10, 2025 17:37
@Illviljan
Copy link
Contributor

A interpolation use case that doesn't crash with this PR:

import numpy as np

import xarray as xr

number_of_files = 700
number_of_groups = 5
number_of_variables = 10

datasets = {}
for f in range(number_of_files):
    for g in range(number_of_groups):
        # Create random data
        time = np.linspace(0, 50 + f, 1 + 1000 * g)
        y = f * time + g

        # Create dataset:
        ds = xr.Dataset(
            data_vars={
                f"temperature_{g}{i}": ("time", y)
                for i in range(number_of_variables // number_of_groups)
            },
            coords={"time": ("time", time)},
        ).chunk()

        # Prepare for xr.DataTree:
        name = f"file_{f}/group_{g}"
        datasets[name] = ds
dt = xr.DataTree.from_dict(datasets)

# %% Interpolate to same time coordinate
def ds_interp(ds, *args, **kwargs):
    return ds.interp(*args, **kwargs)


new_time = np.linspace(0, 100, 50)
dt_interp = dt.map_over_datasets(
    ds_interp, kwargs=dict(time=new_time, assume_sorted=True)
)

@mathause
Copy link
Collaborator Author

Thanks for the example. This PR would also close #10013. This would be a huge plus for me. Not being able to subtract a ds from a datatree makes it extremely cumbersome. However, this implies that the binary ops are implemented using map_over_datasets and means there is a considerable behavior change.

@mathause
Copy link
Collaborator Author

@TomNicholas do you see any chance this PR might get merged (after adding tests etc. obviously)? Are there discussions beside #9693 that I am missing?

@TomNicholas TomNicholas added the topic-DataTree Related to the implementation of a DataTree class label Feb 27, 2025
@TomNicholas
Copy link
Member

Hey @mathause - sorry for forgetting about this - I've been busy.

I think something like this should get merged, but there are various small and fairly arbitrary choices to quibble over. They are basically all already mentioned in #9693 though.

this implies that the binary ops are implemented using map_over_datasets and means there is a considerable behavior change.

I don't understand this statement though - aren't binary ops already implemented using map_over_datasets?

return map_over_datasets(ds_binop, self, other)

We're changing the behaviour, but changing it to be closer to the old datatree, which is what a lot of users expect anyway.

@mathause mathause closed this Mar 6, 2025
@mathause mathause reopened this Mar 6, 2025
@mathause mathause marked this pull request as ready for review March 7, 2025 04:42
@mathause
Copy link
Collaborator Author

mathause commented Mar 7, 2025

I think this is ready for review


Hey @mathause - sorry for forgetting about this - I've been busy.

No worries! Thanks for considering this PR!

I think something like this should get merged, but there are various small and fairly arbitrary choices to quibble over. They are basically all already mentioned in #9693 though.

The one unclear choice from #9693 was the comment by @shoyer #9693 (comment):

I'm not sure whether or not to call the mapped over function for nodes that only define coordinates. Certainly I would not blindly copy coordinates from otherwise empty nodes onto the result, because those coordinates may no longer be relevant on the result.

I currently use DataTree.has_data, this includes nodes that only have coords (although nodes which inherit coords are excluded (I think)). I don't see a way to be clever about these nodes.

this implies that the binary ops are implemented using map_over_datasets and means there is a considerable behavior change.

I don't understand this statement though - aren't binary ops already implemented using map_over_datasets?

Yes sorry that was not clear. I just wanted to say that binary ops are also affected.

@mathause
Copy link
Collaborator Author

gentle ping @TomNicholas

(apologies for bothering you again - I am unfortunately currently blocked by this in another project. Or is there someone else who could potentially review this?)

@TomNicholas
Copy link
Member

no worries, replied here #9693 (comment)

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
topic-DataTree Related to the implementation of a DataTree class
Projects
None yet
Development

Successfully merging this pull request may close these issues.

datatree gets dis-aligned in binary op map_over_datasets throws error on nodes without datasets
3 participants