Poster-content.Rmd

--- 
title: "Content of the poster"
author:
- name: Christophe Vanderaa
  affiliation: Computational Biology, UCLouvain
- name: Laurent Gatto
  affiliation: Computational Biology, UCLouvain
date: "`r Sys.Date()`"
output:
  html_document
bibliography: ref.bib
---

# Preamble 

This document contains the content of the poster for the EuroBioc2019 conference. It also includes the code to produce the figures. The purpose is not to generate the poster but to having a crude version of the poster without any aesthetic considerations.

For poster tips, check this [website](https://www.animateyour.science/post/how-to-design-an-award-winning-conference-poster).

*Abstract*: Recent advances in sample preparation, processing and mass spectrometry (MS) have allowed the emergence of MS-based single-cell proteomics (SCP). However, bioinformatics tools to process and analyze these new types of data are still missing. In order to boost the development and the benchmarking of SCP methodologies, we are developing the scpdata experiment package. The package will distribute published and curated SCP data sets in standardized Bioconductor format. The poster will give an overview of the available data sets and show preliminary results regarding data exploration and processing.

What follows is the content of the poster. 

# **scpdata: a data package for single-cell proteomics**

```{r load, echo=FALSE, results='hide'}
# Load required data
suppressPackageStartupMessages(library(cowplot))
suppressPackageStartupMessages(library(export))
suppressPackageStartupMessages(library(ggplot2))
suppressPackageStartupMessages(library(kableExtra))
suppressPackageStartupMessages(library(knitr))
suppressPackageStartupMessages(library(gridExtra))
suppressPackageStartupMessages(library(magrittr))
suppressPackageStartupMessages(library(MSnbase))
suppressPackageStartupMessages(library(scpdata))
suppressPackageStartupMessages(library(tidyr))
suppressWarnings(suppressMessages(source("~/tmp/scp/R/utils-0.0.1.R")))
# This file can be found on Github "cvanderaa/scp/R/utils-0.0.1.R"
```

# Summary 

Recent advances in sample preparation, processing and mass spectrometry (MS) have allowed the emergence of MS-based single-cell proteomics (SCP). However, bioinformatics tools to process and analyze these new types of data are still missing. In order to boost the development and the benchmarking of SCP methodologies, we are developing the `scpdata` experiment package. The package will distribute published and curated SCP data sets in standardized `Bioconductor` format. 

# Introduction

There are two main types of MS-SCP data:

* Label-free proteomics: the nanoPOTS technology developed by Zhu and colleagues (@Zhu2018-bf) analyzes one sample/cell per MS run. Although the throughput is low (~10 samples/day), it allows for an accurate peptide quantification. 

![](figs/nanoPOTS.png)

* TMT-based proteomics: the SCoPE pipeline developed by the Slavov Lab (@Budnik2018-qh) combines different samples/cells in a single MS run using tandem-mass tags (TMT) labeling. It increases the throughput and the identification rate, but reduces the quantification accuracy.

![](figs/scopems.png)

Data will be available for both techniques. This allows an informed choice for which statistical method to use for a given technique.


# Content of the package 

`scpdata` contains SCP data sets formatted as `MSnSet`s (from the Bioconductor package `MSnbase`). There are different processing stages of MS data for SCP:  

* Raw data: list of intensities with associated MS(x) and retention time information
* Peptide data: list of identified peptides
* Protein data: a table of protein x sample

Currently available data sets are listed using:

```{r scpdata, eval=FALSE}
scpdata()
```
```{r content, echo=FALSE}
desc <- scpdata()$result[, -c(1,2), drop=F]
kable(desc) %>%
  kable_styling("striped")
```

Every data set has a help file that describes the content of the data set, how it was processed, and lists useful links for retrieving the original data. For instance, information about the `specht2019_peptide` data set can be retrieved using:

```{r man, eval=FALSE}
?specht2019_peptide
```


# Data manipulation

Thanks to the Bioconductor class `MSnSet`, data sets can easily be processed and analyzed in a standard and systematic way. For example, we re-implemented the R script provided in @Specht2019-jm as object oriented functions. The pipeline for processing peptide expression data becomes:

```{r data_manip, eval=FALSE, cache=TRUE}
data("specht2019_peptide")
specht2019_peptide %>% 
  scp_normalize_stat(what = "row", stats = mean, fun = "-") %>%
  scp_aggregateByProtein() %>%
  scp_normalize_stat(what = "column", stats = median, fun = "-") %>%
  scp_normalize_stat(what = "row", stats = mean, fun = "-") %>%
  imputeKNN(k = 3) %>%
  batchCorrect(batch = "raw.file", target = "celltype") -> scpd
```

# Data quality control

While developing the SCoPE technology, the Slavov lab also suggested some quality control (QC) measures and visualizations (@Huffman2019-ns). The `scpdata` package provides an ideal working environment for developing and improving SCP data QC. For instance, the TMT channel intensity plot allows to quickly identify failed runs or failed sample preparations: 

```{r QC, echo=FALSE, cache=TRUE, fig.width=10}

sc <- specht2019_peptide2
# Keep single cell runs FP94 and FP97
sel <- !grepl("blank|_QC_|col19|col2[0-4]", pData(sc)$run) &
  grepl("FP9[47]", pData(sc)$run)
sc <- sc[, sel]
run <- "190222S_LCA9_X_FP94BF" # pData(sc)$run[1]
sc <- sc[, pData(sc)$run == run]
# Format the data
channel <- pData(sc)$channel
channel <- gsub("Reporter[.]intensity[.]", "", channel)
channel <- paste0("TMT", as.numeric(channel) + 1)
df <- data.frame(run = run, channel = channel, 
                 cellType = pData(sc)$cell_type, t(exprs(sc)))
df <- pivot_longer(data = df, cols = -(1:3), values_to = "intensity") 
# Get counts per channel
df %>%  group_by(channel) %>% 
  summarise(max = max(intensity, na.rm = TRUE), 
            n = sum(!is.na(intensity)),
            cellType = NA,
            mean = mean(intensity, na.rm = TRUE),
            median = median(intensity, na.rm = TRUE)) -> counts
# Create the plot
pl <- ggplot(data = df, aes(x = channel, y = intensity)) +
  geom_violin(data = df, aes(fill = cellType), na.rm = TRUE) + scale_y_log10() + 
  geom_point(data = counts, aes(x = channel, y = median, shape = "+"), 
             color = "red", size = 5) + 
  geom_text(data = counts, aes(x = channel, y = max*2, 
                               label = paste0("n=", n)),
            size = 4, color = "grey50") + 
  scale_fill_manual(name = "Well type",
                    values = c(carrier_mix = "grey80", unused = "grey90",
                               norm = "skyblue3", sc_0 = "wheat", 
                               sc_m0 = "#b3ba82", sc_u = "coral"), 
                    limits = c("carrier_mix", "unused", 
                               "norm", "sc_0", "sc_m0"),
                    labels = c(carrier_mix = "carrier (100c)", 
                               norm = "reference (5c)",
                               sc_0 = "empty", unused = "unused", 
                               sc_m0 = "macrophage (1c)")) +
  scale_x_discrete(limits = paste0("TMT", 1:11)) +
  scale_shape_manual(values = "+",
                     labels = c(`+` = "Median"),
                     name = "")
pl
.tmp <- graph2png(x = pl, file = "./figs/QC.png", aspectr = 2)

```

**Figure 1: MS intensity distributions per channel at the peptide level.** Contamination peptides or peptides with a low identification score were removed. Data taken from run `190222S_LCA9_X_FP94BF` published in @Specht2019-jm. `n` stands for the number of non-missing peptides.


# Data validation

There are 2 mindsets for data validation:

- Generate in vitro/silico simulated data sets: ground truth is known but could deviate from reality
- Generate data from biological samples with some prior knowledge (manual sampling, FACS staining, experimental design,...): close to a real experiment but no ground truth available

Dimension reduction and data visualization (PCA, tSNE, UMAP) is often used for validating data (Figure 2). When ground truth is available, we can further use benchmarking metrics: silhouette width, entropy of cluster accuracy, kBET for batch correction. 


```{r PCA, echo=FALSE, cache=TRUE, warning=FALSE, message=FALSE, fig.width=10}
scpd <- specht2019_peptide
pca <- nipals(exprs(scpd), ncomp = 3, center = TRUE, scale = TRUE)
pl1 <- customPCA(scpd, pca, x = "PC2", y = "PC1", color = "celltype", 
          shape = "batch_chromatography") + 
  theme(legend.position = "none")
pl2 <- customPCA(scpd, pca, x = "PC3", y = "PC1", color = "celltype", 
          shape = "batch_chromatography") + 
  scale_color_manual(name = "Cell type", 
                     values = c(sc_m0 = "skyblue3",
                                sc_u = "coral"),
                     labels = c(sc_m0 = "macrophages",
                                sc_u = "monocytes")) + 
  scale_shape(name = "Batch")
# Save plot
suppressMessages(
  graph2png(x = plot(grid.arrange(pl1, pl2, ncol = 2, widths=c(3, 4))), 
            file = "./figs/PCA.png", aspectr = 2)
)
```
 
 **Figure 2: PCA plot of peptide expression data**. Macrophages and monocytes are well separated in the third principal component. However, the first and second components are driven by batch effects. Monocytes are untreated U-937 cells, macrophages are U-937 treated with PMA for 24 hours. LCA10 and LCA9 are two chromatographic batches. The PCA was performed using the non-linear iterative partial least squares (NIPALS) algorithm that is robust against missing data.
 
# Problems to tackle

## Batch effects

Batch effects are inherent to MS-SCP data since many samples/cells cannot be analyzed in one go (even with TMT labeling). Different samples have to be distributed across different MS runs and this leads to tremendous batch effects (Figure 2). These batch effects should be corrected before performing downstream analysis to avoid biased conclusions.

## Missingness

The @Specht2019-jm data sets contains +/- 75 \% missing data. This needs to be accounted for by the statistical methods used for downstream analyses. Furthermore, conditions can be differentially affected by missingness that could lead to imputation-induced differential expression (Figure 3).

```{r missingness, echo=FALSE, fig.asp = 1}
sc <- specht2019_peptide
# Format the data
df <- do.call(cbind, lapply(unique(pData(sc)$celltype), function(x){
  .sub <- exprs(sc)[, pData(sc)$celltype == x]
  mis <- rowSums(is.na(.sub))/ncol(.sub)*100
  logFC <- apply(.sub, 1, median, na.rm = TRUE)
  out <- data.frame(mis, logFC)
  colnames(out) <- paste0(c("mis", "logFc"), "_", x)
  return(out)
}))
df$relFC <- df$logFc_sc_m0 - df$logFc_sc_u
df$relFC[df$relFC > 2] <- 2
df$relFC[df$relFC < -2] <- -2
# Scatter plot
sp <- ggplot(data = df, aes(x = mis_sc_m0, y = mis_sc_u, col = relFC), 
             size = 0.8) +
  geom_point() + 
  scale_color_gradient2(name = "log2(rFC)", low = "darkgreen", breaks = -2:2, 
                        labels = c("Monocyte", -1:1, "Macrophage"),
                        high = "red3", midpoint = 0, mid = "wheat") +
  theme(legend.position = c(1, 0), legend.justification = c("right", "bottom"),
        plot.margin = unit(c(0, 0, 0, 0), "cm"),
        legend.background = element_rect(fill="transparent", 
                                         size=0.5, linetype="solid")) +
  ylab("Missingness (%) in monocytes") + xlab("Missingness (%) in macrophages")
# Macrophage density plot
dp1 <- ggplot(data = df, aes(mis_sc_m0)) + 
  geom_density(fill = "grey") +
  theme(axis.text = element_blank(), axis.title = element_blank(),
        axis.ticks = element_blank(),
        plot.margin = unit(c(0, 0, 0, 1), "cm"),
        panel.background = element_blank())
# monocyte density plot
dp2 <- ggplot(data = df, aes(mis_sc_u)) + 
  geom_density(fill = "grey") +
  coord_flip() +
  theme(axis.text = element_blank(), axis.title = element_blank(),
        axis.ticks = element_blank(),
        plot.margin = unit(c(0, 0, 1, 0), "cm"),
        panel.background = element_blank())
# Empty plot
blank <- ggplot() + geom_blank() + theme(panel.background = element_blank())
# Combine and save plots
pl <- plot_grid(dp1, blank, sp, dp2, align = "none",
                rel_widths=c(10, 1), rel_heights=c(1, 10))
plot(pl)
suppressMessages(graph2png(x = pl, file = "./figs/missing.png", aspectr = 1))
```

**Figure 3: missingness in macrophages and monocytes**. Distribution of the proportion missing data in monocytes against macrophages. Color indicates the log2 fold change of macrophages (red) over monocytes (green). 

## Curse of dimensionality

The current experiments are producing relatively small data sets (thousands of peptides x hundreds of cells) compared to single-cell transcriptomics. Nevertheless, the rapid growing of MS-SCP techniques will lead to data sets of much higher dimensionality. This is a challenge for both the statistical analyses and the software optimization. Possible solution should be inspired from current achievement in single cell transcriptomics. 


# Conclusion

MS-based SCP is still at its infancy. However, we hope that developing an MS-SCP data package will provide a strong framework for bioinformatics research. It will support the development of new packages for tackling the statistical issues (missing data, batch effect, high dimensionality) seen in MS-SCP data, as well as providing a growing data repository for software benchmarking. By thorough implementation and benchmarking of its software, MS-SCP might become a new state-of-the-art technique for single-cell omcis.

# Acknowledgements

Research fellowship by the FRS-FNRS 

# Reference