map raster #588

ArneDefauw · 2024-06-17T12:18:48Z

updated 'apply' function for raster data.

Following discussion #407, I decided to remove the 'combine_c', 'combine_z' parameters, and also support for different Callable's/fn_kwargs per channel/z-stack. I think it overcomplicates the code; the user can simply use a for loop if different Callables/fn_kwargs should be passed to different channels/z-stacks.

see unit tests for examples:

https://github.com/ArneDefauw/spatialdata/blob/b3e4487cbdfe34ccf38b5a311de1eb2a4b5427c4/tests/core/operations/test_map.py#L31

Signature is now as follows

def map_raster(
    data: SpatialImage | MultiscaleSpatialImage,
    func: Callable,
    fn_kwargs: Mapping[str, Any] = MappingProxyType({}),
    chunks: str | int | tuple[int, ...] | tuple[tuple[int, ...], ...] | None = None,
    output_chunks: tuple[tuple[int, ...], ...] | None = None,
    depth: str | int | tuple[int, ...] | dict[int:int] | None = None,
    scale_factors: ScaleFactors_t | None = None,  # if specified will return multiscale
    c_coords: int | str | Iterable[int | str] | None = None, 
    **kwargs,
) -> SpatialImage | MultiscaleSpatialImage:

I can not see how we can remove scale_factors or c_coords from the signature without loosing some functionality.

In order for map_raster to be able to alter dimension (e.g. (c,z,y,x)->(c,y,x) ), we would need also need to pass dims I think, see https://github.com/ArneDefauw/spatialdata/blob/b3e4487cbdfe34ccf38b5a311de1eb2a4b5427c4/tests/core/operations/test_map.py#L149

codecov · 2024-06-17T12:23:28Z

Codecov Report

Attention: Patch coverage is 96.66667% with 2 lines in your changes missing coverage. Please review.

Project coverage is 92.00%. Comparing base (782efab) to head (5f8db56).
Report is 1 commits behind head on main.

❗ Current head 5f8db56 differs from pull request most recent head 759fd99

Please upload reports for the commit 759fd99 to get more accurate results.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #588      +/-   ##
==========================================
+ Coverage   91.98%   92.00%   +0.01%     
==========================================
  Files          43       44       +1     
  Lines        6560     6613      +53     
==========================================
+ Hits         6034     6084      +50     
- Misses        526      529       +3

Files	Coverage Δ
src/spatialdata/_core/centroids.py	`100.00% <100.00%> (ø)`
src/spatialdata/_core/operations/map.py	`100.00% <100.00%> (ø)`
src/spatialdata/_core/operations/vectorize.py	`94.44% <100.00%> (+0.06%)`	⬆️
src/spatialdata/_io/io_raster.py	`95.65% <100.00%> (-0.08%)`	⬇️
src/spatialdata/models/_utils.py	`90.90% <80.00%> (-0.89%)`	⬇️

... and 1 file with indirect coverage changes

LucaMarconato · 2024-06-18T11:22:00Z

Before the full review, two quick questions.

I see that chunks is used only in this line here arr = da.asarray(arr).rechunk(chunks). I would remove the chunks argument and have the user rechunk the data before calling map_raster(), in this way we have one less argument and the function is more transparent. Or, what's the argument to delegate this task inside map_raster()?

So, there is one test (test_map_raster_output_chunks()) in which the chunk argument is required. I kept this argument and renamed it to input_chunks.

Update.
Nevertheless, most of the time the argument input_chunks is not required as the data is already chunked correctly. I have therefore changed the default behavior. Now, if input_chunks is None, the data is simply used as it is, without the need to call rechunk. Now, to apply the function to the full data and not chunkwise (the old input_chunks=None behavior), can be achieved by passing chunkwise=False (default is True).

LucaMarconato · 2024-06-18T11:27:27Z

We need to decide how the output type (in the sense of single-scale vs multi-scale) relates to the input type. We have some options:
1. the output type is always the same as the input type: if the user passes a single scale, a single-scale image will be given, if the user passes a multiscale image, then each scale will be processed independently and returned as a multiscale object. This means that if the user is interested in processing only the top scale of a multiscale object, they will need to pass next(iter(my_multiscale['scale0'].data))
2. the output type is always by default a single-scale image, which is computed by the largest scale available.
3. as in 2, but we have an argument scale_factors that compute a multiscale object. I actually would prefer 1 or 2. In fact we should rather have an helper function to_multiscale() that calls the model and returns the a multiscale object; in fact constructing multiscale objects from a single scale is a common task that the user should be able to do easily beyond the use cases of this PR.

What are your thought on this? I would either go for 1 or 2, and as explained above, I would add an helper function to avoid the need to go for 3.

Update: I have implemented option 2. Happy to discuss.

LucaMarconato · 2024-06-18T11:44:39Z

Another minor comment. I will remove the calls to rechunk to address the problem of data not being saved if the chunks as irregular. Let's fix this in a more general setting inside write(), as also asked by this user: #581.

LucaMarconato · 2024-06-18T14:12:35Z

I have reviewed the code, fixed the pre-commits, simplified the APIs while maintaining all the use cases of the tests. @giovp can you also review please?

Some minor comments:

@ArneDefauw I added 3 TODOs to improve the docstrings by adding some examples for those parameters that allow many different types, such as chunks. Edit: actually I addressed them, corrected the types and referred to the dask docs when appropriate.
Some minor tests for the exceptions should be added, for instance passing a non-raster type
We should test also against labels, not just images.

LucaMarconato · 2024-06-18T22:02:10Z

I addressed all the points that I left open; some tests are failing in the CI but not locally, I think they are due to the new support DataArray/DataTree vs SpatialImage/MultiscaleSpatialImage. After that is addressed, if @giovp approves I'd merge.

LucaMarconato · 2024-06-18T22:29:20Z

I have added the support for DataArray/DataTree. Some tests were failing because the raster models were returning a SpatialImage instead of a DataTree due to a call of to_spatial_image(). I have fixed this, and I fixed other minor things regarding the DataArray/DataTree refactoring. @melonora please be aware of them. If you need them you may consider cherry-picking the relevant code from my next commit until this PR is merged.

melonora · 2024-06-18T22:48:07Z

We need to decide how the output type (in the sense of single-scale vs multi-scale) relates to the input type. We have some options:

the output type is always the same as the input type: if the user passes a single scale, a single-scale image will be given, if the user passes a multiscale image, then each scale will be processed independently and returned as a multiscale object. This means that if the user is interested in processing only the top scale of a multiscale object, they will need to pass next(iter(my_multiscale['scale0'].data))

the output type is always by default a single-scale image, which is computed by the largest scale available.

as in 2, but we have an argument scale_factors that compute a multiscale object. I actually would prefer 1 or 2. In fact we should rather have an helper function to_multiscale() that calls the model and returns the a multiscale object; in fact constructing multiscale objects from a single scale is a common task that the user should be able to do easily beyond the use cases of this PR.

What are your thought on this? I would either go for 1 or 2, and as explained above, I would add an helper function to avoid the need to go for 3.

Update: I have implemented option 2. Happy to discuss.

Think option 2 is indeed best as a user can still pull out a particular scale and use apply should the user want a different level. Also with the output the user can still create a multiscale again.

melonora · 2024-06-18T22:50:10Z

I think this PR will require a notebook tutorial

src/spatialdata/models/models.py

melonora

I see no blockers from my side. Great work!

ArneDefauw · 2024-06-19T04:12:07Z

We need to decide how the output type (in the sense of single-scale vs multi-scale) relates to the input type. We have some options:

1. the output type is always the same as the input type: if the user passes a single scale, a single-scale image will be given, if the user passes a multiscale image, then each scale will be processed independently and returned as a multiscale object. This means that if the user is interested in processing only the top scale of a multiscale object, they will need to pass `next(iter(my_multiscale['scale0'].data))`
2. the output type is always by default a single-scale image, which is computed by the largest scale available.
3. as in 2, but we have an argument `scale_factors` that compute a multiscale object. I actually would prefer 1 or 2. In fact we should rather have an helper function `to_multiscale()` that calls the model and returns the a multiscale object; in fact constructing multiscale objects from a single scale is a common task that the user should be able to do easily beyond the use cases of this PR.

What are your thought on this? I would either go for 1 or 2, and as explained above, I would add an helper function to avoid the need to go for 3.

Update: I have implemented option 2. Happy to discuss.

Yes, indeed, while I see value in being able to provide scale_factors as a parameter, I think option 2 is indeed the cleanest solution, as converting to multiscale is a bit outside scope of the function.

ArneDefauw · 2024-06-19T06:08:15Z

Regarding this part of the code, I have three comments:

https://github.com/ArneDefauw/spatialdata/blob/71c46654b5d85742d7ee71e3ee630533f276e339/src/spatialdata/_core/operations/map.py#L90

if not chunkwise:
    arr = func(arr, **fn_kwargs)
    if output_chunks is not None:
        arr = arr.rechunk(output_chunks)
else:
    if input_chunks is not None:
        arr = arr.rechunk(input_chunks)

The introduction of the chunkwise parameter is a great choice, as it is more explicit than the chunks parameter I previously used. However, now I would remove the input_chunks parameter, as the user can simply rechunk on the SpatialImage or MultiscaleSpatialImage via .chunk(...), e.g.

    se = map_raster(
        sdata_blobs[img_layer].chunk(100),
        func=_multiply,
        fn_kwargs=fn_kwargs,
        chunkwise=False,
        c_coords=None,
        depth=None,
    )

If we want to keep the input_chunks parameter, we should also use it to rechunk when chunkwise is set to False, I do not see why we should not do this.

If we remove input_chunks, we can rename output_chunks to chunks , and then its name is the same as in da.map_blocks and da.map_overlap; which prevents confusion for users that are familiar with dask that would pass chunks to kwargs.
I would not rechunk with output_chunks if blockwise is False. I would ignore output_chunks if blockwise is False, and state as such in the docstring.

I am happy to make these changes if you would agree

ArneDefauw · 2024-06-19T06:32:24Z

I've added a unit test for altering dimension (c,z,y,x) -> (z,y,x), as it was still commented as a todo

LucaMarconato · 2024-06-19T09:49:04Z

Thanks for the comments, I agree on all the changes, if you could make them it would be great thanks.

ArneDefauw · 2024-06-19T11:02:18Z

Thanks for the comments, I agree on all the changes, if you could make them it would be great thanks.

Ok, will do!

ArneDefauw · 2024-06-19T11:57:08Z

Thanks for the comments, I agree on all the changes, if you could make them it would be great thanks.

Ok, will do!

Done. I also changed the name of the parameter chunkwise to blockwise, as da.map_overlap and da.map_blocks are just special cases of da.blockwise.

giovp

looks really great @ArneDefauw , some general questions:
this seems a much more slimmed down version of #407 , in particular, it seems that no calls to dask image are really ever done, but that it just target use case of having became functions to be applied to raster data.

Does this mean that there will be a separate API/method for calls to dask.image?
For things like region_props, where both a label and an image layer are needed, would there also be a separate API/method or will be added to this function in the future?

Considering answers to this question, I wonder if the name of the function should be renamed?

These are all genuine questions, I think this is definitely ready to get in as is, and it's great, I just wanted to raise once more about the roadmap/use case of similar, yet not exactly the case, functionalities and whether they should/will be added to spatialdata.

src/spatialdata/_core/operations/map.py

giovp · 2024-06-19T13:01:31Z

src/spatialdata/_core/operations/map.py

+                    f"Depth {depth} is provided for {len(depth)} dimensions. "
+                    f"Please (only) provide depth for {arr.ndim} dimensions."
+                )
+            kwargs["depth"] = coerce_depth(arr.ndim, depth)


while corce_depth is a public function, it is not documented, I wonder if it's better to just be explicit about what to pass with depth (and basically raise error as above) instead . also again here, kwargs is used to store intermediate arguments, that might already exists. I am not sure this is the best way to do it, could it also just save the result as variables and being passed explicitly later, if the arguments are not already in kwargs?

If depth would be passed in kwargs, it will appear as an argument, not in kwargs, so in that respect I do not see a problem.
Instead of putting depth and chunks in kwargs, we could indeed also do the following:

map_func(func, arr, chunks=chunks, depth=depth, **fn_kwargs, **kwargs, dtype=arr.dtype)

We could also remove chunks and depth from the list of parameters altogether, and simply let users pass them via kwargs. In that scenario, I would also remove these sanity checks on depth and delegate them to da.map_overlap.

Re coerce_depth. We have three options:

Remove coerce_depth from our code, and let users pass whatever map_overlap can work with, i.e. we delegate conversion to a dict to map_overlap (which will also call coerce_depth).

Remove coerce_depth, and only accept e.g. tuples; and add some sanity checks. But then we are a bit too restrictive I think, escpecially for users that are used to work with map_overlapand prefer to pass e.g. a dict.

Keep the code as is. Having coerce_depth in our code, makes it a bit more explicit as what is exactly passed to map_overlap, which could help debugging later on.

I have a preferance for 3, but not too strong of an opinion on it

giovp · 2024-06-19T13:02:28Z

src/spatialdata/_core/operations/map.py

+        c_coords = None
+    if transformations is None:
+        d = get_transformation(data, get_all=True)
+        assert isinstance(d, dict)


when would it not be a dict? I thought it should always be?

When get_all=False (default), d is a BaseTransformation. The assert is to avoid mypy errors.

Co-authored-by: Giovanni Palla <25887487+giovp@users.noreply.github.com>

for more information, see https://pre-commit.ci

ArneDefauw · 2024-06-20T11:23:03Z

looks really great @ArneDefauw , some general questions: this seems a much more slimmed down version of #407 , in particular, it seems that no calls to dask image are really ever done, but that it just target use case of having became functions to be applied to raster data.
* Does this mean that there will be a separate API/method for calls to dask.image?

* For things like `region_props`, where both a label and an image layer are needed, would there also be a separate API/method or will be added to this function in the future?
Considering answers to this question, I wonder if the name of the function should be renamed?

These are all genuine questions, I think this is definitely ready to get in as is, and it's great, I just wanted to raise once more about the roadmap/use case of similar, yet not exactly the case, functionalities and whether they should/will be added to spatialdata.

Functionality of this PR is pretty similar as #407. The only difference is the removal of support for being able to process different channels/z-stacks with different Callable/fn_kwargs, as it introduces some complexity that is not really necessary, I think the user can simply write a for loop if different Callable/fn_kwargs need to be used, and then concat the DataArray as required.

Re API/method for calls to dask.image. I would be happy to implement this, if this would be considered an added value. With the map_raster function one can now do the following, e.g.:

def _filter(arr, sigma=5):
    from dask_image.ndfilters import gaussian_filter

    arr = arr.squeeze(0)
    return gaussian_filter(arr, sigma)[None, ...]


func_kwargs = {"sigma": 5}
se = map_raster(
    sdata_blobs["blobs_image"].sel(c=[0]),
    func=_filter,
    c_coords=None,
    func_kwargs=func_kwargs,
    blockwise=False,
)

Re region_props, for me this seem to fall a bit outside scope of this function. I think we could potentially extend map_raster with support for:
image -> labels (segmentation)
multiple labels -> labels (e.g.merging of masks )
multiple images -> images

Extracting region_props seems to fall under aggregate in my opinion, where the result is a table.

Anyway, it would be good to have a discussion about the roadmap with respect to map and aggregate.

giovp

looks all good to me, thank you @ArneDefauw ! I've updated the docstrings striving to clarify but would ask you and @LucaMarconato to give it another read cause maybe there is still room for improvement. If they look good, we just need release note and then LGTM!

src/spatialdata/_core/operations/map.py

giovp

looks all good to me, thank you @ArneDefauw ! I've updated the docstrings striving to clarify but would ask you and @LucaMarconato to give it another read cause maybe there is still room for improvement. If they look good, we just need release note and then LGTM!

LucaMarconato · 2024-06-21T12:24:34Z

Looks great! I will update the changelog and the docs and then merge! 🚀

ArneDefauw · 2024-06-21T12:39:03Z

looks good! Thanks @giovp only one comment:


    dims
        The dimensions of the output data. If not provided, the dimensions of the input data are used. It must be
        specified if the callable changes the data dimensions, e.g. `('c', 'y', 'x') -> ('y', 'x')`.

If going from ('c', 'y', 'x') -> ('y', 'x'), we go from images to labels; I don't think we planned to support going from images to labels with map_raster?

In practice, code should work for that use case I think, but need to test, I think we only need to change:

if model not in (Labels2DModel, Labels3DModel):

to

if 'c' not in dims:

LucaMarconato · 2024-06-21T13:23:02Z

I don't think we planned to support going from images to labels with map_raster?

I enabled this with the line model = get_raster_model_from_data_dims(dims); which gets the appropriate model for the output data based on the dimension of the output data.

Therefore the lines if model not in (Labels2DModel, Labels3DModel) and if 'c' not in dims are equivalent.

ArneDefauw · 2024-06-21T13:41:33Z

I don't think we planned to support going from images to labels with map_raster?

I enabled this with the line model = get_raster_model_from_data_dims(dims); which gets the appropriate model for the output data based on the dimension of the output data.

Therefore the lines if model not in (Labels2DModel, Labels3DModel) and if 'c' not in dims are equivalent.

But the model is obtained from the input data

https://github.com/ArneDefauw/spatialdata/blob/759fd992b6881a03e890e4082ec6c83f26aa1630/src/spatialdata/_core/operations/map.py#L88

So this check is on input data:

https://github.com/ArneDefauw/spatialdata/blob/759fd992b6881a03e890e4082ec6c83f26aa1630/src/spatialdata/_core/operations/map.py#L114

LucaMarconato · 2024-06-21T14:02:53Z

You are right, the check on c_coords should be one the output data. I merged already because I need to merge #594, which was opened against the current PR. Please can you make a patch for fixing that behavior?

ArneDefauw · 2024-06-21T14:22:41Z

You are right, the check on c_coords should be one the output data. I merged already because I need to merge #594, which was opened against the current PR. Please can you make a patch for fixing that behavior?

Yes, will do!

map raster

b3e4487

ArneDefauw mentioned this pull request Jun 17, 2024

Apply function #407

Closed

Merge branch 'main' into apply_raster

618f59e

code review + refactoring

8722e8f

LucaMarconato requested a review from giovp June 18, 2024 14:13

LucaMarconato added 3 commits June 18, 2024 23:45

fix docstrings and types of dask arguments

3215e34

add tests for exceptions

3e9a6cb

map raster support and tests for labels

0b21ff0

LucaMarconato marked this pull request as ready for review June 18, 2024 21:59

Merge branch 'main' into apply_raster

8d97163

melonora and others added 3 commits June 19, 2024 00:29

adjust for geopandas 1.0.0

069d5db

fix tests; fix bugs around DataArray/DataTree

a07b286

Merge branch 'pre_ci_fix' into apply_raster

7a2c2c8

melonora reviewed Jun 18, 2024

View reviewed changes

src/spatialdata/models/models.py Outdated Show resolved Hide resolved

Update src/spatialdata/models/models.py

71c4665

melonora approved these changes Jun 18, 2024

View reviewed changes

apply_raster squeeze z

3f85ce1

replace input_chunks and output_chunks by chunks

4bf6ff1

giovp requested changes Jun 19, 2024

View reviewed changes

melonora mentioned this pull request Jun 19, 2024

Cleanup spatialimage #594

Merged

ArneDefauw and others added 3 commits June 20, 2024 08:51

Update src/spatialdata/_core/operations/map.py

a9698ee

Co-authored-by: Giovanni Palla <25887487+giovp@users.noreply.github.com>

Update src/spatialdata/_core/operations/map.py

3d1860b

Co-authored-by: Giovanni Palla <25887487+giovp@users.noreply.github.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

27de8aa

for more information, see https://pre-commit.ci

apply raster c_coords + docs

5f8db56

ArneDefauw requested a review from giovp June 20, 2024 11:30

giovp added 5 commits June 21, 2024 11:47

improve docstrings

f2cd4db

add to api

ad7550c

add import

9ac6d40

finish import

0594c42

make mypy happy

fbf25d5

giovp requested a review from LucaMarconato June 21, 2024 10:08

giovp approved these changes Jun 21, 2024

View reviewed changes

src/spatialdata/_core/operations/map.py Outdated Show resolved Hide resolved

src/spatialdata/_core/operations/map.py Outdated Show resolved Hide resolved

giovp approved these changes Jun 21, 2024

View reviewed changes

changelog

759fd99

LucaMarconato approved these changes Jun 21, 2024

View reviewed changes

LucaMarconato merged commit eec6ab3 into scverse:main Jun 21, 2024
7 checks passed

giovp mentioned this pull request Jun 24, 2024

check on c_coords should be one the output data in map_raster #597

Closed

ArneDefauw mentioned this pull request Aug 7, 2024

Add relabel_sequential() for relabeling label data after map_raster() #664

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

map raster #588

map raster #588

ArneDefauw commented Jun 17, 2024

codecov bot commented Jun 17, 2024 •

edited

Loading

LucaMarconato commented Jun 18, 2024 •

edited

Loading

LucaMarconato commented Jun 18, 2024 •

edited

Loading

LucaMarconato commented Jun 18, 2024

LucaMarconato commented Jun 18, 2024 •

edited

Loading

LucaMarconato commented Jun 18, 2024

LucaMarconato commented Jun 18, 2024

melonora commented Jun 18, 2024

melonora commented Jun 18, 2024

melonora left a comment

ArneDefauw commented Jun 19, 2024

ArneDefauw commented Jun 19, 2024 •

edited

Loading

ArneDefauw commented Jun 19, 2024

LucaMarconato commented Jun 19, 2024

ArneDefauw commented Jun 19, 2024

ArneDefauw commented Jun 19, 2024

giovp left a comment

giovp Jun 19, 2024

ArneDefauw Jun 20, 2024

giovp Jun 19, 2024

LucaMarconato Jun 19, 2024

ArneDefauw commented Jun 20, 2024

giovp left a comment

giovp left a comment

LucaMarconato commented Jun 21, 2024

ArneDefauw commented Jun 21, 2024

LucaMarconato commented Jun 21, 2024

ArneDefauw commented Jun 21, 2024

LucaMarconato commented Jun 21, 2024

ArneDefauw commented Jun 21, 2024

map raster #588

map raster #588

Conversation

ArneDefauw commented Jun 17, 2024

codecov bot commented Jun 17, 2024 • edited Loading

Codecov Report

LucaMarconato commented Jun 18, 2024 • edited Loading

LucaMarconato commented Jun 18, 2024 • edited Loading

LucaMarconato commented Jun 18, 2024

LucaMarconato commented Jun 18, 2024 • edited Loading

LucaMarconato commented Jun 18, 2024

LucaMarconato commented Jun 18, 2024

melonora commented Jun 18, 2024

melonora commented Jun 18, 2024

melonora left a comment

Choose a reason for hiding this comment

ArneDefauw commented Jun 19, 2024

ArneDefauw commented Jun 19, 2024 • edited Loading

ArneDefauw commented Jun 19, 2024

LucaMarconato commented Jun 19, 2024

ArneDefauw commented Jun 19, 2024

ArneDefauw commented Jun 19, 2024

giovp left a comment

Choose a reason for hiding this comment

giovp Jun 19, 2024

Choose a reason for hiding this comment

ArneDefauw Jun 20, 2024

Choose a reason for hiding this comment

giovp Jun 19, 2024

Choose a reason for hiding this comment

LucaMarconato Jun 19, 2024

Choose a reason for hiding this comment

ArneDefauw commented Jun 20, 2024

giovp left a comment

Choose a reason for hiding this comment

giovp left a comment

Choose a reason for hiding this comment

LucaMarconato commented Jun 21, 2024

ArneDefauw commented Jun 21, 2024

LucaMarconato commented Jun 21, 2024

ArneDefauw commented Jun 21, 2024

LucaMarconato commented Jun 21, 2024

ArneDefauw commented Jun 21, 2024

codecov bot commented Jun 17, 2024 •

edited

Loading

LucaMarconato commented Jun 18, 2024 •

edited

Loading

LucaMarconato commented Jun 18, 2024 •

edited

Loading

LucaMarconato commented Jun 18, 2024 •

edited

Loading

ArneDefauw commented Jun 19, 2024 •

edited

Loading