perf: add a setup_anndata method (#54)

* perf: add a setup_anndata method * fix(docs): downgrade protobuf to allow success when building docs * chore(setup): refactor codes
Novartis · Jun 7, 2022 · 923b1e5 · 923b1e5
1 parent e81cfc3
commit 923b1e5
Show file tree

Hide file tree

Showing 15 changed files with 2,274 additions and 1,165 deletions.
diff --git a/docs/Introduction.rst b/docs/Introduction.rst
@@ -17,7 +17,7 @@ The design of scAR
    :width: 600
    :align: center
 
-scAR uses a latent variable model to represent the biological and technical components in the observed count data. scAR is built under ambient signal hypothesis, in which the probability of occurrence of each ambient transcript can be empirically estimated from cell-free droplets. There are two hidden variables, contamination level per cell and the probability of occurrence of native transcript. With these three parameters, we are able to reconstruct the noisy observations. To learn the hidden variables, we train neural networks (the variational autoencoder) to minimize the differences between the reconstructions and original noisy observations. Once converging, the contamination levels and native expression are inferred and downstream analysis can be performed using these values.
+scAR uses a latent variable model to represent the biological and technical components in the observed count data. scAR is built under ambient signal hypothesis, in which the probability of occurrence of each ambient transcript can be empirically estimated from cell-free droplets. There are two hidden variables, contamination level per cell and the probability of occurrence of native transcript. With these three parameters, we are able to reconstruct the noisy observations. To learn the hidden variables, we train neural networks (the variational autoencoder) to minimize the differences between the reconstructions and original noisy observations. Once converging, the contamination levels and native expression are inferred and downstream analysis can be performed using these values. See our manuscript [Sheng2022]_ for details.
 
 What types of data that scAR can process?
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

diff --git a/docs/Reference.rst b/docs/Reference.rst
@@ -1,6 +1,14 @@
 Reference
 ===============
 
-.. [Sheng2022] C. Sheng, R. Lopes, G. Li, S. Schuierer, A. Waldt, R. Cuttat, S. Dimitrieva, A. Kauffmann, E. Durand, G. G. Galli, G. Roma, A. De, W., 
-   **Probabilistic machine learning ensures accurate ambient denoising in droplet-based single-cell omics**. 2022.
-   `bioRxiv <https://www.biorxiv.org/content/early/2022/03/24/2022.01.14.476312>`__.
+.. [Lun2019] Lun *et al.* (2019),
+   `EmptyDrops: Distinguishing cells from empty droplets in droplet-based single-cell RNA sequencing data <https://doi.org/10.1186/s13059-019-1662-y>`__,
+   Genome Biology.
+
+.. [Dixit2016] Dixit *et al.* (2016),
+   `Perturb-Seq: Dissecting Molecular Circuits with Scalable Single-Cell RNA Profiling of Pooled Genetic Screens <http://dx.doi.org/10.1016/j.cell.2016.11.038>`__,
+   Cell.
+
+.. [Sheng2022] Sheng *et al.* (2022),
+   `Probabilistic machine learning ensures accurate ambient denoising in droplet-based single-cell omics <https://www.biorxiv.org/content/early/2022/03/24/2022.01.14.476312>`__,
+   bioRxiv.
diff --git a/docs/requirements.txt b/docs/requirements.txt
@@ -7,4 +7,5 @@ sphinx-argparse
 sphinx-disqus
 autodocsumm
 ipykernel
+protobuf==3.20.*
 git+https://github.com/Novartis/scAR.git
diff --git a/docs/tutorials/Tutorials.ipynb b/docs/tutorials/Tutorials.ipynb
@@ -28,9 +28,10 @@
     ]
    },
    "source": [
+    "* [Ambient profile](scAR_tutorial_ambient_profile.ipynb)\n",
     "* [sgRNA assignment](scAR_tutorial_sgRNA_assignment.ipynb)\n",
-    "* [Assignment of identity-barcodes](scAR_tutorial_identity_barcode.ipynb)\n",
-    "* [Denoising ADT for CITE-seq](scAR_tutorial_denoising_CITEseq.ipynb)\n",
+    "* [CMO Assignment](scAR_tutorial_identity_barcode.ipynb)\n",
+    "* [Denoising ADT](scAR_tutorial_denoising_CITEseq.ipynb)\n",
     "* [Denoising mRNA](scAR_tutorial_mRNA_denoising.ipynb)"
    ]
   },

diff --git a/docs/tutorials/scAR_tutorial_ambient_profile.ipynb b/docs/tutorials/scAR_tutorial_ambient_profile.ipynb
diff --git a/docs/tutorials/scAR_tutorial_denoising_CITEseq.ipynb b/docs/tutorials/scAR_tutorial_denoising_CITEseq.ipynb
diff --git a/docs/tutorials/scAR_tutorial_identity_barcode.ipynb b/docs/tutorials/scAR_tutorial_identity_barcode.ipynb
@@ -122,12 +122,26 @@
     "## Estimate ambient profile"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "9038aa21-175f-4a07-8ba3-0d8986c2a6fd",
+   "metadata": {},
+   "source": [
+    "Identify cell-containing and cell-free droplets using kneeplot of mRNA counts.  "
+   ]
+  },
   {
    "cell_type": "markdown",
    "id": "5365f55d-a8b4-4f85-934d-e2944110a7d1",
    "metadata": {},
    "source": [
-    "Identify cell-containing and cell-free droplets using kneeplot of mRNA counts."
+    "<div class=\"alert alert-info\">\n",
+    "\n",
+    "Note\n",
+    "\n",
+    "An alternative is to calculate ambient profile with `setup_anndata` method. See an example in [calculating ambient profile](https://scar-tutorials.readthedocs.io/en/latest/scAR_tutorial_ambient_profile.html).\n",
+    "\n",
+    "</div>"
    ]
   },
   {
@@ -152,13 +166,7 @@
    "id": "73b48649-ea80-484a-83b1-1abe3fe3deee",
    "metadata": {},
    "source": [
-    "<div class=\"alert alert-info\">\n",
-    "\n",
-    "Note\n",
-    "\n",
-    "The thresholds (200 and 500) are experiment-specific. We currently manually determine them by examing the following kneeplot. \n",
-    "\n",
-    "</div>"
+    "The thresholds (200 and 500) are experiment-specific. We here manually determine them by examing the following kneeplot."
    ]
   },
   {

diff --git a/docs/tutorials/scAR_tutorial_mRNA_denoising.ipynb b/docs/tutorials/scAR_tutorial_mRNA_denoising.ipynb
diff --git a/docs/tutorials/scAR_tutorial_sgRNA_assignment.ipynb b/docs/tutorials/scAR_tutorial_sgRNA_assignment.ipynb
diff --git a/docs/usages/index.rst b/docs/usages/index.rst
@@ -3,7 +3,19 @@ Python API
 
 Processing
 ----------------------
-The core module of scar.
+Calculate ambient profile
+
+.. currentmodule:: scar.main._setup
+
+.. autosummary::
+   :nosignatures:
+
+   setup_anndata
+
+
+Training
+----------------------
+The core module of scar
 
 .. currentmodule:: scar.main._scar
 

diff --git a/docs/usages/processing.rst b/docs/usages/processing.rst
@@ -1,7 +1,7 @@
-Denoising model
-===============================
-.. automodule:: scar.main._scar
+Setup anndata
+==============
+Calculate ambient profile for relevant feature types
 
-.. autoclass:: model
-    :members:
-    :member-order: bysource
+.. automodule:: scar.main._setup
+
+.. autofunction:: setup_anndata
diff --git a/docs/usages/training.rst b/docs/usages/training.rst
@@ -0,0 +1,8 @@
+Denoising model
+===============================
+
+.. automodule:: scar.main._scar
+
+.. autoclass:: model
+    :members:
+    :member-order: bysource
diff --git a/scar/__init__.py b/scar/__init__.py
@@ -2,4 +2,5 @@
 
 from .main.__version__ import __version__
 from .main._scar import model
+from .main._setup import setup_anndata
 from .main import _data_generater as data_generator
diff --git a/scar/main/_scar.py b/scar/main/_scar.py
@@ -75,7 +75,7 @@ class model:
                     the sparsity should be low; on the other hand, it should be set high \
                         in the case of unflitered genes. \
                         Forced to be one in the mode of "sgRNA(s)" and "tag(s)". \
-                            Thank Will Macnair very much for the valuable feedback.
+                            Thank Will Macnair for the valuable feedback.
 
         Raises
         ------
@@ -553,11 +553,11 @@ def inference(
             cutoff for Bayesfactors, by default 3
         round_to_int : str, optional
             whether to round the counts, by default "stochastic_rounding"
-        moi : int, optional (under development) \
+        moi : int, optional (under development)
             multiplicity of infection. If assigned, it will allow optimized thresholding, \
                 which tests a series of cutoffs to find the best one \
                     based on distributions of infections under given moi.\
-                        See http://dx.doi.org/10.1016/j.cell.2016.11.038, by default None
+                        See Perturb-seq [Dixit2016]_ for details, by default None
         Returns
         -------
             After inferring, several attributes will be added, inc. native_counts, bayesfactor,\
@@ -629,7 +629,7 @@ def assignment(self, cutoff=3, moi=None):
             If assigned, it will allow optimized thresholding,\
                 which tests a series of cutoffs to find the best one \
                     based on distributions of infections under given moi.\
-                        See http://dx.doi.org/10.1016/j.cell.2016.11.038, by default None
+                        See  Perturb-seq [Dixit2016]_, by default None
         Returns
         -------
             After running, a attribute 'feature_assignment' will be added,\

diff --git a/scar/main/_setup.py b/scar/main/_setup.py
@@ -0,0 +1,201 @@
+from typing import Union
+import numpy as np
+import pandas as pd
+import matplotlib.pyplot as plt
+import seaborn as sns
+from anndata import AnnData
+import torch
+from torch.distributions.multinomial import Multinomial
+
+
+def setup_anndata(
+    adata: AnnData,
+    raw_adata: AnnData,
+    feature_type: Union[str, list] = None,
+    prob: float = 0.995,
+    min_raw_counts: int = 2,
+    iterations: int = 3,
+    n_batch: int = 1,
+    kneeplot: bool = True,
+    verbose: bool = True,
+    figsize: tuple = (6, 6),
+):
+    """Calculate ambient profile for relevant features
+
+    Identify the cell-free droplets through a multinomial distribution. See EmptyDrops [Lun2019]_ for details.
+
+
+    Parameters
+    ----------
+    adata : AnnData
+        A filtered adata object, loaded from filtered_feature_bc_matrix using `scanpy.read` , gene filtering is recommended to save memory
+    raw_adata : AnnData
+        An raw adata object, loaded from raw_feature_bc_matrix using `scanpy.read`
+    feature_type : Union[str, list], optional
+        Feature type, e.g. 'Gene Expression', 'Antibody Capture', 'CRISPR Guide Capture' or 'Multiplexing Capture', all feature types are calculated if None, by default None
+    prob : float, optional
+        The probability of each gene, considered as containing ambient RNA if greater than prob (joint prob euqals to the product of all genes for a droplet), by default 0.995
+    min_raw_counts : int, optional
+        Total counts filter for raw_adata, filtering out low counts to save memory, by default 2
+    iterations : int, optional
+        Total iterations, by default 3
+    n_batch : int, optional
+        Total number of batches, set it to a bigger number when out of memory issue occurs, by default 1
+    kneeplot : bool, optional
+        Kneeplot to show subpopulations of droplets, by default True
+    verbose : bool, optional
+        Whether to display message
+    figsize : tuple, optimal
+        Figure size, by default (6, 6)
+
+    Returns
+    -------
+    The relevant ambient profile is added in adata.uns
+
+    Examples
+    ---------
+    .. plot::
+        :context: close-figs
+
+        import scanpy as sc
+        from scar import setup_anndata
+        # read filtered data
+        adata = sc.read_10x_h5(filename='500_hgmm_3p_LT_Chromium_Controller_filtered_feature_bc_matrix.h5ad',
+                             backup_url='https://cf.10xgenomics.com/samples/cell-exp/6.1.0/500_hgmm_3p_LT_Chromium_Controller/500_hgmm_3p_LT_Chromium_Controller_filtered_feature_bc_matrix.h5');
+        adata.var_names_make_unique();
+        # read raw data
+        adata_raw = sc.read_10x_h5(filename='500_hgmm_3p_LT_Chromium_Controller_raw_feature_bc_matrix.h5ad',
+                             backup_url='https://cf.10xgenomics.com/samples/cell-exp/6.1.0/500_hgmm_3p_LT_Chromium_Controller/500_hgmm_3p_LT_Chromium_Controller_raw_feature_bc_matrix.h5');
+        adata_raw.var_names_make_unique();
+        # gene and cell filter
+        sc.pp.filter_genes(adata, min_counts=200);
+        sc.pp.filter_genes(adata, max_counts=6000);
+        sc.pp.filter_cells(adata, min_genes=200);
+        # setup anndata
+        setup_anndata(
+            adata,
+            adata_raw,
+            feature_type = "Gene Expression",
+            prob = 0.975,
+            min_raw_counts = 2,
+            kneeplot = True,
+        )
+    """
+
+    if feature_type is None:
+        feature_type = adata.var["feature_types"].unique()
+    elif isinstance(feature_type, str):
+        feature_type = [feature_type]
+
+    # take subset genes to save memory
+    raw_adata._inplace_subset_var(raw_adata.var_names.isin(adata.var_names))
+    raw_adata._inplace_subset_obs(raw_adata.X.sum(axis=1) >= min_raw_counts)
+
+    raw_adata.obs["total_counts"] = raw_adata.X.sum(axis=1)
+    raw_count = raw_adata.X.astype(int).A
+
+    # initial estimation of ambient profile, will be update
+    ambient_prof = raw_adata.X.sum(axis=0) / raw_adata.X.sum()
+
+    if verbose:
+        print("Estimating ambient profile for ", feature_type, "...")
+    i = 0
+    while i < iterations:
+
+        # calculate joint probability (log) of being cell-free droplets for each droplet
+
+        log_prob = []
+        batches = np.array_split(raw_count, n_batch)
+        for b in range(n_batch):
+            count_batch = batches[b]
+            log_prob_batch = Multinomial(
+                probs=torch.tensor(ambient_prof), validate_args=False
+            ).log_prob(torch.Tensor(count_batch))
+            log_prob.append(log_prob_batch)
+
+        log_prob = np.concatenate(log_prob, axis=0)
+        raw_adata.obs["log_prob"] = log_prob
+        raw_adata.obs["droplets"] = "other droplets"
+
+        # cell-containing droplets
+        raw_adata.obs.loc[adata.obs_names, "droplets"] = "cells"
+
+        # identify cell-free droplets
+        raw_adata.obs["droplets"] = raw_adata.obs["droplets"].mask(
+            raw_adata.obs["log_prob"] >= np.log(prob) * raw_adata.shape[1],
+            "cell-free droplets",
+        )
+        emptydrops = raw_adata[raw_adata.obs["droplets"] == "cell-free droplets"]
+
+        ambient_prof = emptydrops.X.sum(axis=0) / emptydrops.X.sum()
+
+        i += 1
+
+        if verbose:
+            print("iteration: ", i)
+
+    # update ambient profile
+    for ft in feature_type:
+        tmp = emptydrops[:, emptydrops.var["feature_types"] == ft]
+        adata.uns[f"ambient_profile_{ft}"] = pd.DataFrame(
+            tmp.X.sum(axis=0).reshape(-1, 1) / tmp.X.sum(),
+            index=tmp.var_names,
+            columns=[f"ambient_profile_{ft}"],
+        )
+
+    if kneeplot:
+        _, axs = plt.subplots(2, figsize=figsize)
+
+        all_droplets = raw_adata.obs.copy()
+        all_droplets = (
+            all_droplets.sort_values(by="total_counts", ascending=False)
+            .reset_index()
+            .rename_axis("rank_by_counts")
+            .reset_index()
+        )
+        all_droplets = all_droplets.loc[all_droplets["total_counts"] >= min_raw_counts]
+        all_droplets = all_droplets.set_index("index").rename_axis("cells")
+        all_droplets = (
+            all_droplets.sort_values(by="log_prob", ascending=True)
+            .reset_index()
+            .rename_axis("rank_by_log_prob")
+            .reset_index()
+            .set_index("cells")
+        )
+
+        ax = sns.lineplot(
+            data=all_droplets,
+            x="rank_by_counts",
+            y="total_counts",
+            hue="droplets",
+            hue_order=["cells", "other droplets", "cell-free droplets"],
+            palette=sns.color_palette()[-3:],
+            markers=False,
+            lw=2,
+            ci=None,
+            ax=axs[0],
+        )
+
+        ax.set_xscale("log")
+        ax.set_yscale("log")
+        ax.set_xlabel("")
+        ax.set_title("cell-free droplets have lower counts")
+
+        all_droplets["prob"] = np.exp(all_droplets["log_prob"])
+        ax = sns.lineplot(
+            data=all_droplets,
+            x="rank_by_log_prob",
+            y="prob",
+            hue="droplets",
+            hue_order=["cells", "other droplets", "cell-free droplets"],
+            palette=sns.color_palette()[-3:],
+            markers=False,
+            lw=2,
+            ci=None,
+            ax=axs[1],
+        )
+        ax.set_xscale("log")
+        ax.set_xlabel("sorted droplets")
+        ax.set_title("cell-free droplets have relatively higher probs")
+
+        plt.tight_layout()