Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Added filter_table_by_query #894

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

srivarra
Copy link

@srivarra srivarra commented Mar 3, 2025

Added filtering by a table query as discussed in #626. Added both a standalone function sd.filter_table_by_query and a method sd.SpatialData.filter_table_by_query.

Function signature

class SpatialData:
	...

    def filter_by_table_query(
        self,
        table_name: str,
        filter_tables: bool = True,
        elements: list[str] | None = None,
        obs_expr: Predicates | None = None,
        var_expr: Predicates | None = None,
        x_expr: Predicates | None = None,
        obs_names_expr: Predicates | None = None,
        var_names_expr: Predicates | None = None,
        layer: str | None = None,
        how: Literal["left", "left_exclusive", "inner", "right", "right_exclusive"] = "right",
    ) -> SpatialData:

sd.filter_by_table_query is the same, but instead of self, you have to provide the SpatialData object of interest.


What expressions can you use?

  • Several methods are supported by narwhals. As long as the method doesn't aggregate.
    • I know that the following work: >,>=,<,<=, ==, is_in,
    • And from Expr.str contains, starts_with, ends_with work.

What parts can you filter on?

You can filter on the obs and var DataFrame attributes of AnnData.

You can filter on obs_names and var_names. (uses an.obs_names, and an.var_names instead of an.col)

You can filter on the expression matrix X w.r.t layers as well.


Some Examples

# Using the mibitof dataset cause it's small and has a table which covers multiple spatialdata elements.

import spatialdata as sd
import annsel as an
from upath import UPath

mibitof_path = UPath("~/Downloads/mibitof-dataset.zarr")

sdata = sd.read_zarr(mibitof_path)

sdata
SpatialData Repr
SpatialData object, with associated Zarr store: [/Users/srivarra/Downloads/mibitof-dataset.zarr](https://file+.vscode-resource.vscode-cdn.net/Users/srivarra/Downloads/mibitof-dataset.zarr)
├── Images
│     ├── 'point8_image': DataArray[cyx] (3, 1024, 1024)
│     ├── 'point16_image': DataArray[cyx] (3, 1024, 1024)
│     └── 'point23_image': DataArray[cyx] (3, 1024, 1024)
├── Labels
│     ├── 'point8_labels': DataArray[yx] (1024, 1024)
│     ├── 'point16_labels': DataArray[yx] (1024, 1024)
│     └── 'point23_labels': DataArray[yx] (1024, 1024)
└── Tables
      └── 'table': AnnData (3309, 36)
with coordinate systems:
    ▸ 'point8', with elements:
        point8_image (Images), point8_labels (Labels)
    ▸ 'point16', with elements:
        point16_image (Images), point16_labels (Labels)
    ▸ 'point23', with elements:
        point23_image (Images), point23_labels (Labels)

For context here is what the table looks like:

AnnData object with n_obs × n_vars = 3309 × 36
    obs: 'row_num', 'point', 'cell_id', 'X1', 'center_rowcoord', 'center_colcoord', 'cell_size', 'category', 'donor', 'Cluster', 'batch', 'library_id'
    uns: 'spatialdata_attrs'
    obsm: 'X_scanorama', 'X_umap', 'spatial'
  1. Filter with respect the donor "21d7", and filter var_names where we have "ASCT2", "ATP5A" and any marker that starts with "CD".
sd.filter_by_table_query(
    sdata,
    table_name="table",
    obs_expr=an.col("donor") == "21d7",
    var_names_expr=(
        an.var_names.is_in(["ASCT2", "ATP5A"])
        | an.var_names.str.starts_with("CD")
    ),
    x_expr=None,
)
Output

SpatialData object
├── Labels
│     └── 'point23_labels': DataArray[yx] (1024, 1024)
└── Tables
      └── 'table': AnnData (1241, 14)
with coordinate systems:
    ▸ 'point23', with elements:
        point23_labels (Labels)

  1. Filter by batches "0" and "1".
sdata.filter_by_table_query(
    table_name="table",
    obs_expr=an.col("batch").is_in(["1", "0"]),
)
Output

SpatialData object
├── Labels
│     ├── 'point8_labels': DataArray[yx] (1024, 1024)
│     └── 'point23_labels': DataArray[yx] (1024, 1024)
└── Tables
      └── 'table': AnnData (2286, 36)
with coordinate systems:
    ▸ 'point8', with elements:
        point8_labels (Labels)
    ▸ 'point23', with elements:
        point23_labels (Labels)

  1. Filter by obs_names which start with "9"
sd.filter_by_table_query(
    sdata,
    table_name="table",
    obs_names_expr=an.obs_names.str.starts_with("9")
)
Output

SpatialData object
├── Labels
│     └── 'point8_labels': DataArray[yx] (1024, 1024)
└── Tables
      └── 'table': AnnData (624, 36)
with coordinate systems:
    ▸ 'point8', with elements:
        point8_labels (Labels)

  1. Note that tuples of Expressions applies an & operator
sd.filter_by_table_query(
    sdata,
    table_name="table",
    var_names_expr=(an.var_names.str.contains("CD"), an.var_names == "CD8"),
    x_expr=None,
)
Output

SpatialData object
├── Labels
│     ├── 'point8_labels': DataArray[yx] (1024, 1024)
│     ├── 'point16_labels': DataArray[yx] (1024, 1024)
│     └── 'point23_labels': DataArray[yx] (1024, 1024)
└── Tables
      └── 'table': AnnData (3309, 1)
with coordinate systems:
    ▸ 'point8', with elements:
        point8_labels (Labels)
    ▸ 'point16', with elements:
        point16_labels (Labels)
    ▸ 'point23', with elements:
        point23_labels (Labels)

  1. You can make it as complicated as you want.
sd.filter_by_table_query(
    sdata,
    table_name="table",
    # Filter observations (rows) based on multiple conditions
    obs_expr=(
        # Cells from donor 21d7 OR 90de
        an.col("donor").is_in(["21d7", "90de"])
        # AND cells with size greater than 400
        & (an.col("cell_size") > 400)
        # AND cells that are either Epithelial or contain "Tcell" in their cluster name
        & (an.col("Cluster") == "Epithelial")
        | (an.col("Cluster").str.contains("Tcell"))
    ),
    # Filter variables (columns) based on multiple conditions
    var_names_expr=(
        # Select columns that start with CD
        an.var_names.str.starts_with("CD")
        # OR columns that contain "ATP"
        | an.var_names.str.contains("ATP")
        # OR specific columns
        | an.var_names.is_in(["ASCT2", "PKM2", "SMA"])
    ),
    # Filter based on expression values
    x_expr=(
        # Keep cells where ASCT2 is greater than 0.1
        (an.col("ASCT2") > 0.1)
        # AND less than 2 for ASCT2
        & (an.col("ASCT2") < 2)
    ),
    # Additional parameters
    how="right",
    elements=["point23_labels", "point8_labels"],
)
Output

SpatialData object
├── Labels
│     ├── 'point8_labels': DataArray[yx] (1024, 1024)
│     └── 'point23_labels': DataArray[yx] (1024, 1024)
└── Tables
      └── 'table': AnnData (268, 17)
with coordinate systems:
    ▸ 'point8', with elements:
        point8_labels (Labels)
    ▸ 'point23', with elements:
        point23_labels (Labels)

Copy link

codecov bot commented Mar 3, 2025

Codecov Report

Attention: Patch coverage is 44.44444% with 5 lines in your changes missing coverage. Please review.

Project coverage is 92.05%. Comparing base (62356a2) to head (3140fc0).

Files with missing lines Patch % Lines
src/spatialdata/_core/query/relational_query.py 40.00% 3 Missing ⚠️
src/spatialdata/_core/spatialdata.py 50.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #894      +/-   ##
==========================================
- Coverage   92.11%   92.05%   -0.06%     
==========================================
  Files          48       48              
  Lines        7429     7438       +9     
==========================================
+ Hits         6843     6847       +4     
- Misses        586      591       +5     
Files with missing lines Coverage Δ
src/spatialdata/__init__.py 96.42% <ø> (ø)
src/spatialdata/_core/spatialdata.py 91.29% <50.00%> (-0.17%) ⬇️
src/spatialdata/_core/query/relational_query.py 90.59% <40.00%> (-0.55%) ⬇️

@srivarra srivarra marked this pull request as draft March 6, 2025 05:01
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant