-
Notifications
You must be signed in to change notification settings - Fork 25
Python quick start
TileDB-SOMA is available on PyPI and can be installed via pip
as indicated below. Full installation instructions can be found here.
pip install tiledbsoma
SOMA objects can be created with their respective create()
methods and then need to be populated in specific ways depending on their types.
However, a SOMAExperiment
can be easily created from and anndata object or a *h5ad
file. Here, one is created from a *.h5ad
file.
import tiledbsoma.io
# Create a and write a SOMA Experiment, source data https://github.com/chanzuckerberg/cellxgene/raw/main/example-dataset/pbmc3k.h5ad
pbmc3k_uri = tiledbsoma.io.from_h5ad("./pbmc3k", input_path = "pbmc3k.h5ad", measurement_name = "RNA")
SOMA objects can be opened using tildedbsoma.open()
.
The contents of DataFrame
, SparseNDArray
and DenseNDArray
can be accessed with their respective read()
methods. For DataFrame
and SparseNDArray
the method returns an iterator useful for larger-than-memory operations.
For example you can open the SOMAExperiment
created above and then read the contents of obs
which is a SOMADataFrame
.
In addition, this example shows how you can query for observations with louvian
values of 'Megakaryocytes' and 'CD4 T cells', and n_genes
greater than 500.
import tiledbsoma
with tiledbsoma.open(pbmc3k_uri) as pbmc3k_soma:
pbmc3k_obs_slice = pbmc3k_soma.obs.read(
value_filter="n_genes >500 and louvain in ['Megakaryocytes', 'CD4 T cells']"
)
# Concatenate iterator to pyarrow.Table
pbmc3k_obs_slice.concat()
The result is a pyarrow.Table
containing a slice based on the specified filters.
pyarrow.Table
soma_joinid: int64
obs_id: large_string
n_genes: int64
percent_mito: float
n_counts: float
louvain: large_string
----
soma_joinid: [[0,2,8,11,12,...,2617,2621,2626,2631,2637]]
obs_id: [["AAACATACAACCAC-1","AAACATTGATCAGC-1","AAACGCTGTAGCCA-1","AAACTTGATCCAGA-1","AAAGAGACGAGATA-1",...,"TTGTAGCTAGCTCA-1","TTTAGCTGATACCG-1","TTTCACGAGGTTCA-1","TTTCCAGAGGTGAG-1","TTTGCATGCCTCAC-1"]]
n_genes: [[781,1131,533,751,866,...,933,887,721,873,724]]
percent_mito: [[0.030177759,0.008897362,0.011764706,0.010887772,0.010788382,...,0.02224871,0.022875817,0.013261297,0.0068587107,0.008064516]]
n_counts: [[2419,3147,1275,2388,2410,...,2517,2754,2036,2187,1984]]
louvain: [["CD4 T cells","CD4 T cells","CD4 T cells","CD4 T cells","CD4 T cells",...,"CD4 T cells","CD4 T cells","CD4 T cells","CD4 T cells","CD4 T cells"]]
As stated above the read()
methods of DataFrame
and SparseNDArray
return an iterator. The batch size can be specified a in the soma.init_buffer_bytes
config option, for this is example it is set to 100 Bytes:
context = tiledbsoma.options.SOMATileDBContext()
context = context.replace(tiledb_config = {"soma.init_buffer_bytes": 100})
with tiledbsoma.open(pbmc3k_uri, context = context) as pbmc3k_soma:
pbmc3k_obs = pbmc3k_soma.obs.read()
counter = 1
for pbmc3k_obs_chunk in pbmc3k_obs:
# Perform operations
# pbmc3k_obs_chunk is a pyArrow.Table
counter += 1
print(counter)
The counter indicates the number of iterations performed
441