Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

map raster #588

Merged
merged 23 commits into from
Jun 21, 2024
Merged

map raster #588

merged 23 commits into from
Jun 21, 2024

Conversation

ArneDefauw
Copy link
Contributor

updated 'apply' function for raster data.

Following discussion #407, I decided to remove the 'combine_c', 'combine_z' parameters, and also support for different Callable's/fn_kwargs per channel/z-stack. I think it overcomplicates the code; the user can simply use a for loop if different Callables/fn_kwargs should be passed to different channels/z-stacks.

see unit tests for examples:

https://github.com/ArneDefauw/spatialdata/blob/b3e4487cbdfe34ccf38b5a311de1eb2a4b5427c4/tests/core/operations/test_map.py#L31

Signature is now as follows

def map_raster(
    data: SpatialImage | MultiscaleSpatialImage,
    func: Callable,
    fn_kwargs: Mapping[str, Any] = MappingProxyType({}),
    chunks: str | int | tuple[int, ...] | tuple[tuple[int, ...], ...] | None = None,
    output_chunks: tuple[tuple[int, ...], ...] | None = None,
    depth: str | int | tuple[int, ...] | dict[int:int] | None = None,
    scale_factors: ScaleFactors_t | None = None,  # if specified will return multiscale
    c_coords: int | str | Iterable[int | str] | None = None, 
    **kwargs,
) -> SpatialImage | MultiscaleSpatialImage:

I can not see how we can remove scale_factors or c_coords from the signature without loosing some functionality.

In order for map_raster to be able to alter dimension (e.g. (c,z,y,x)->(c,y,x) ), we would need also need to pass dims I think, see https://github.com/ArneDefauw/spatialdata/blob/b3e4487cbdfe34ccf38b5a311de1eb2a4b5427c4/tests/core/operations/test_map.py#L149

@ArneDefauw ArneDefauw mentioned this pull request Jun 17, 2024
Copy link

codecov bot commented Jun 17, 2024

Codecov Report

Attention: Patch coverage is 96.66667% with 2 lines in your changes missing coverage. Please review.

Project coverage is 92.00%. Comparing base (782efab) to head (5f8db56).
Report is 1 commits behind head on main.

Current head 5f8db56 differs from pull request most recent head 759fd99

Please upload reports for the commit 759fd99 to get more accurate results.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #588      +/-   ##
==========================================
+ Coverage   91.98%   92.00%   +0.01%     
==========================================
  Files          43       44       +1     
  Lines        6560     6613      +53     
==========================================
+ Hits         6034     6084      +50     
- Misses        526      529       +3     
Files Coverage Δ
src/spatialdata/_core/centroids.py 100.00% <100.00%> (ø)
src/spatialdata/_core/operations/map.py 100.00% <100.00%> (ø)
src/spatialdata/_core/operations/vectorize.py 94.44% <100.00%> (+0.06%) ⬆️
src/spatialdata/_io/io_raster.py 95.65% <100.00%> (-0.08%) ⬇️
src/spatialdata/models/_utils.py 90.90% <80.00%> (-0.89%) ⬇️

... and 1 file with indirect coverage changes

@LucaMarconato
Copy link
Member

LucaMarconato commented Jun 18, 2024

Before the full review, two quick questions.

  1. I see that chunks is used only in this line here arr = da.asarray(arr).rechunk(chunks). I would remove the chunks argument and have the user rechunk the data before calling map_raster(), in this way we have one less argument and the function is more transparent. Or, what's the argument to delegate this task inside map_raster()?

So, there is one test (test_map_raster_output_chunks()) in which the chunk argument is required. I kept this argument and renamed it to input_chunks.

Update.
Nevertheless, most of the time the argument input_chunks is not required as the data is already chunked correctly. I have therefore changed the default behavior. Now, if input_chunks is None, the data is simply used as it is, without the need to call rechunk. Now, to apply the function to the full data and not chunkwise (the old input_chunks=None behavior), can be achieved by passing chunkwise=False (default is True).

@LucaMarconato
Copy link
Member

LucaMarconato commented Jun 18, 2024

  1. We need to decide how the output type (in the sense of single-scale vs multi-scale) relates to the input type. We have some options:
    1. the output type is always the same as the input type: if the user passes a single scale, a single-scale image will be given, if the user passes a multiscale image, then each scale will be processed independently and returned as a multiscale object. This means that if the user is interested in processing only the top scale of a multiscale object, they will need to pass next(iter(my_multiscale['scale0'].data))
    2. the output type is always by default a single-scale image, which is computed by the largest scale available.
    3. as in 2, but we have an argument scale_factors that compute a multiscale object. I actually would prefer 1 or 2. In fact we should rather have an helper function to_multiscale() that calls the model and returns the a multiscale object; in fact constructing multiscale objects from a single scale is a common task that the user should be able to do easily beyond the use cases of this PR.

What are your thought on this? I would either go for 1 or 2, and as explained above, I would add an helper function to avoid the need to go for 3.

Update: I have implemented option 2. Happy to discuss.

@LucaMarconato
Copy link
Member

Another minor comment. I will remove the calls to rechunk to address the problem of data not being saved if the chunks as irregular. Let's fix this in a more general setting inside write(), as also asked by this user: #581.

@LucaMarconato
Copy link
Member

LucaMarconato commented Jun 18, 2024

I have reviewed the code, fixed the pre-commits, simplified the APIs while maintaining all the use cases of the tests. @giovp can you also review please?

Some minor comments:

  • @ArneDefauw I added 3 TODOs to improve the docstrings by adding some examples for those parameters that allow many different types, such as chunks. Edit: actually I addressed them, corrected the types and referred to the dask docs when appropriate.
  • Some minor tests for the exceptions should be added, for instance passing a non-raster type
  • We should test also against labels, not just images.

@LucaMarconato LucaMarconato requested a review from giovp June 18, 2024 14:13
@LucaMarconato LucaMarconato marked this pull request as ready for review June 18, 2024 21:59
@LucaMarconato
Copy link
Member

I addressed all the points that I left open; some tests are failing in the CI but not locally, I think they are due to the new support DataArray/DataTree vs SpatialImage/MultiscaleSpatialImage. After that is addressed, if @giovp approves I'd merge.

@LucaMarconato
Copy link
Member

I have added the support for DataArray/DataTree. Some tests were failing because the raster models were returning a SpatialImage instead of a DataTree due to a call of to_spatial_image(). I have fixed this, and I fixed other minor things regarding the DataArray/DataTree refactoring. @melonora please be aware of them. If you need them you may consider cherry-picking the relevant code from my next commit until this PR is merged.

@melonora
Copy link
Collaborator

  1. We need to decide how the output type (in the sense of single-scale vs multi-scale) relates to the input type. We have some options:

    1. the output type is always the same as the input type: if the user passes a single scale, a single-scale image will be given, if the user passes a multiscale image, then each scale will be processed independently and returned as a multiscale object. This means that if the user is interested in processing only the top scale of a multiscale object, they will need to pass next(iter(my_multiscale['scale0'].data))
    2. the output type is always by default a single-scale image, which is computed by the largest scale available.
    3. as in 2, but we have an argument scale_factors that compute a multiscale object. I actually would prefer 1 or 2. In fact we should rather have an helper function to_multiscale() that calls the model and returns the a multiscale object; in fact constructing multiscale objects from a single scale is a common task that the user should be able to do easily beyond the use cases of this PR.

What are your thought on this? I would either go for 1 or 2, and as explained above, I would add an helper function to avoid the need to go for 3.

Update: I have implemented option 2. Happy to discuss.

Think option 2 is indeed best as a user can still pull out a particular scale and use apply should the user want a different level. Also with the output the user can still create a multiscale again.

@melonora
Copy link
Collaborator

I think this PR will require a notebook tutorial

Copy link
Collaborator

@melonora melonora left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see no blockers from my side. Great work!

@ArneDefauw
Copy link
Contributor Author

  1. We need to decide how the output type (in the sense of single-scale vs multi-scale) relates to the input type. We have some options:

    1. the output type is always the same as the input type: if the user passes a single scale, a single-scale image will be given, if the user passes a multiscale image, then each scale will be processed independently and returned as a multiscale object. This means that if the user is interested in processing only the top scale of a multiscale object, they will need to pass `next(iter(my_multiscale['scale0'].data))`
    2. the output type is always by default a single-scale image, which is computed by the largest scale available.
    3. as in 2, but we have an argument `scale_factors` that compute a multiscale object. I actually would prefer 1 or 2. In fact we should rather have an helper function `to_multiscale()` that calls the model and returns the a multiscale object; in fact constructing multiscale objects from a single scale is a common task that the user should be able to do easily beyond the use cases of this PR.
    

What are your thought on this? I would either go for 1 or 2, and as explained above, I would add an helper function to avoid the need to go for 3.

Update: I have implemented option 2. Happy to discuss.

Yes, indeed, while I see value in being able to provide scale_factors as a parameter, I think option 2 is indeed the cleanest solution, as converting to multiscale is a bit outside scope of the function.

@ArneDefauw
Copy link
Contributor Author

ArneDefauw commented Jun 19, 2024

Regarding this part of the code, I have three comments:

https://github.com/ArneDefauw/spatialdata/blob/71c46654b5d85742d7ee71e3ee630533f276e339/src/spatialdata/_core/operations/map.py#L90

if not chunkwise:
    arr = func(arr, **fn_kwargs)
    if output_chunks is not None:
        arr = arr.rechunk(output_chunks)
else:
    if input_chunks is not None:
        arr = arr.rechunk(input_chunks)
  1. The introduction of the chunkwise parameter is a great choice, as it is more explicit than the chunks parameter I previously used. However, now I would remove the input_chunks parameter, as the user can simply rechunk on the SpatialImage or MultiscaleSpatialImage via .chunk(...), e.g.
    se = map_raster(
        sdata_blobs[img_layer].chunk(100),
        func=_multiply,
        fn_kwargs=fn_kwargs,
        chunkwise=False,
        c_coords=None,
        depth=None,
    )

If we want to keep the input_chunks parameter, we should also use it to rechunk when chunkwise is set to False, I do not see why we should not do this.

  1. If we remove input_chunks, we can rename output_chunks to chunks , and then its name is the same as in da.map_blocks and da.map_overlap; which prevents confusion for users that are familiar with dask that would pass chunks to kwargs.
  2. I would not rechunk with output_chunks if blockwise is False. I would ignore output_chunks if blockwise is False, and state as such in the docstring.

I am happy to make these changes if you would agree

@ArneDefauw
Copy link
Contributor Author

I've added a unit test for altering dimension (c,z,y,x) -> (z,y,x), as it was still commented as a todo

@LucaMarconato
Copy link
Member

Thanks for the comments, I agree on all the changes, if you could make them it would be great thanks.

@ArneDefauw
Copy link
Contributor Author

Thanks for the comments, I agree on all the changes, if you could make them it would be great thanks.

Ok, will do!

@ArneDefauw
Copy link
Contributor Author

Thanks for the comments, I agree on all the changes, if you could make them it would be great thanks.

Ok, will do!

Done. I also changed the name of the parameter chunkwise to blockwise, as da.map_overlap and da.map_blocks are just special cases of da.blockwise.

Copy link
Member

@giovp giovp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks really great @ArneDefauw , some general questions:
this seems a much more slimmed down version of #407 , in particular, it seems that no calls to dask image are really ever done, but that it just target use case of having became functions to be applied to raster data.

  • Does this mean that there will be a separate API/method for calls to dask.image?
  • For things like region_props, where both a label and an image layer are needed, would there also be a separate API/method or will be added to this function in the future?

Considering answers to this question, I wonder if the name of the function should be renamed?

These are all genuine questions, I think this is definitely ready to get in as is, and it's great, I just wanted to raise once more about the roadmap/use case of similar, yet not exactly the case, functionalities and whether they should/will be added to spatialdata.

f"Depth {depth} is provided for {len(depth)} dimensions. "
f"Please (only) provide depth for {arr.ndim} dimensions."
)
kwargs["depth"] = coerce_depth(arr.ndim, depth)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

while corce_depth is a public function, it is not documented, I wonder if it's better to just be explicit about what to pass with depth (and basically raise error as above) instead . also again here, kwargs is used to store intermediate arguments, that might already exists. I am not sure this is the best way to do it, could it also just save the result as variables and being passed explicitly later, if the arguments are not already in kwargs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If depth would be passed in kwargs, it will appear as an argument, not in kwargs, so in that respect I do not see a problem.
Instead of putting depth and chunks in kwargs, we could indeed also do the following:

map_func(func, arr, chunks=chunks, depth=depth, **fn_kwargs, **kwargs, dtype=arr.dtype)

We could also remove chunks and depth from the list of parameters altogether, and simply let users pass them via kwargs. In that scenario, I would also remove these sanity checks on depth and delegate them to da.map_overlap.

Re coerce_depth. We have three options:

  1. Remove coerce_depth from our code, and let users pass whatever map_overlap can work with, i.e. we delegate conversion to a dict to map_overlap (which will also call coerce_depth).
  2. Remove coerce_depth, and only accept e.g. tuples; and add some sanity checks. But then we are a bit too restrictive I think, escpecially for users that are used to work with map_overlapand prefer to pass e.g. a dict.
  3. Keep the code as is. Having coerce_depth in our code, makes it a bit more explicit as what is exactly passed to map_overlap, which could help debugging later on.

I have a preferance for 3, but not too strong of an opinion on it

c_coords = None
if transformations is None:
d = get_transformation(data, get_all=True)
assert isinstance(d, dict)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when would it not be a dict? I thought it should always be?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When get_all=False (default), d is a BaseTransformation. The assert is to avoid mypy errors.

@melonora melonora mentioned this pull request Jun 19, 2024
ArneDefauw and others added 3 commits June 20, 2024 08:51
Co-authored-by: Giovanni Palla <25887487+giovp@users.noreply.github.com>
Co-authored-by: Giovanni Palla <25887487+giovp@users.noreply.github.com>
@ArneDefauw
Copy link
Contributor Author

looks really great @ArneDefauw , some general questions: this seems a much more slimmed down version of #407 , in particular, it seems that no calls to dask image are really ever done, but that it just target use case of having became functions to be applied to raster data.

* Does this mean that there will be a separate API/method for calls to dask.image?

* For things like `region_props`, where both a label and an image layer are needed, would there also be a separate API/method or will be added to this function in the future?

Considering answers to this question, I wonder if the name of the function should be renamed?

These are all genuine questions, I think this is definitely ready to get in as is, and it's great, I just wanted to raise once more about the roadmap/use case of similar, yet not exactly the case, functionalities and whether they should/will be added to spatialdata.

Functionality of this PR is pretty similar as #407. The only difference is the removal of support for being able to process different channels/z-stacks with different Callable/fn_kwargs, as it introduces some complexity that is not really necessary, I think the user can simply write a for loop if different Callable/fn_kwargs need to be used, and then concat the DataArray as required.

Re API/method for calls to dask.image. I would be happy to implement this, if this would be considered an added value. With the map_raster function one can now do the following, e.g.:

def _filter(arr, sigma=5):
    from dask_image.ndfilters import gaussian_filter

    arr = arr.squeeze(0)
    return gaussian_filter(arr, sigma)[None, ...]


func_kwargs = {"sigma": 5}
se = map_raster(
    sdata_blobs["blobs_image"].sel(c=[0]),
    func=_filter,
    c_coords=None,
    func_kwargs=func_kwargs,
    blockwise=False,
)

Re region_props, for me this seem to fall a bit outside scope of this function. I think we could potentially extend map_raster with support for:
image -> labels (segmentation)
multiple labels -> labels (e.g.merging of masks )
multiple images -> images

Extracting region_props seems to fall under aggregate in my opinion, where the result is a table.

Anyway, it would be good to have a discussion about the roadmap with respect to map and aggregate.

@ArneDefauw ArneDefauw requested a review from giovp June 20, 2024 11:30
@giovp giovp requested a review from LucaMarconato June 21, 2024 10:08
Copy link
Member

@giovp giovp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks all good to me, thank you @ArneDefauw ! I've updated the docstrings striving to clarify but would ask you and @LucaMarconato to give it another read cause maybe there is still room for improvement. If they look good, we just need release note and then LGTM!

Copy link
Member

@giovp giovp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks all good to me, thank you @ArneDefauw ! I've updated the docstrings striving to clarify but would ask you and @LucaMarconato to give it another read cause maybe there is still room for improvement. If they look good, we just need release note and then LGTM!

@LucaMarconato
Copy link
Member

Looks great! I will update the changelog and the docs and then merge! 🚀

@ArneDefauw
Copy link
Contributor Author

looks good! Thanks @giovp only one comment:


    dims
        The dimensions of the output data. If not provided, the dimensions of the input data are used. It must be
        specified if the callable changes the data dimensions, e.g. `('c', 'y', 'x') -> ('y', 'x')`.

If going from ('c', 'y', 'x') -> ('y', 'x'), we go from images to labels; I don't think we planned to support going from images to labels with map_raster?

In practice, code should work for that use case I think, but need to test, I think we only need to change:

if model not in (Labels2DModel, Labels3DModel):

to

if 'c' not in dims:

@LucaMarconato
Copy link
Member

I don't think we planned to support going from images to labels with map_raster?

I enabled this with the line model = get_raster_model_from_data_dims(dims); which gets the appropriate model for the output data based on the dimension of the output data.

Therefore the lines if model not in (Labels2DModel, Labels3DModel) and if 'c' not in dims are equivalent.

@ArneDefauw
Copy link
Contributor Author

I don't think we planned to support going from images to labels with map_raster?

I enabled this with the line model = get_raster_model_from_data_dims(dims); which gets the appropriate model for the output data based on the dimension of the output data.

Therefore the lines if model not in (Labels2DModel, Labels3DModel) and if 'c' not in dims are equivalent.

But the model is obtained from the input data

https://github.com/ArneDefauw/spatialdata/blob/759fd992b6881a03e890e4082ec6c83f26aa1630/src/spatialdata/_core/operations/map.py#L88

So this check is on input data:

https://github.com/ArneDefauw/spatialdata/blob/759fd992b6881a03e890e4082ec6c83f26aa1630/src/spatialdata/_core/operations/map.py#L114

@LucaMarconato LucaMarconato merged commit eec6ab3 into scverse:main Jun 21, 2024
7 checks passed
@LucaMarconato
Copy link
Member

You are right, the check on c_coords should be one the output data. I merged already because I need to merge #594, which was opened against the current PR. Please can you make a patch for fixing that behavior?

@ArneDefauw
Copy link
Contributor Author

You are right, the check on c_coords should be one the output data. I merged already because I need to merge #594, which was opened against the current PR. Please can you make a patch for fixing that behavior?

Yes, will do!

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants