-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathPoster-content.Rmd
267 lines (210 loc) · 13.5 KB
/
Poster-content.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
---
title: "Content of the poster"
author:
- name: Christophe Vanderaa
affiliation: Computational Biology, UCLouvain
- name: Laurent Gatto
affiliation: Computational Biology, UCLouvain
date: "`r Sys.Date()`"
output:
html_document
bibliography: ref.bib
---
# Preamble
This document contains the content of the poster for the EuroBioc2019 conference. It also includes the code to produce the figures. The purpose is not to generate the poster but to having a crude version of the poster without any aesthetic considerations.
For poster tips, check this [website](https://www.animateyour.science/post/how-to-design-an-award-winning-conference-poster).
*Abstract*: Recent advances in sample preparation, processing and mass spectrometry (MS) have allowed the emergence of MS-based single-cell proteomics (SCP). However, bioinformatics tools to process and analyze these new types of data are still missing. In order to boost the development and the benchmarking of SCP methodologies, we are developing the scpdata experiment package. The package will distribute published and curated SCP data sets in standardized Bioconductor format. The poster will give an overview of the available data sets and show preliminary results regarding data exploration and processing.
What follows is the content of the poster.
# **scpdata: a data package for single-cell proteomics**
```{r load, echo=FALSE, results='hide'}
# Load required data
suppressPackageStartupMessages(library(cowplot))
suppressPackageStartupMessages(library(export))
suppressPackageStartupMessages(library(ggplot2))
suppressPackageStartupMessages(library(kableExtra))
suppressPackageStartupMessages(library(knitr))
suppressPackageStartupMessages(library(gridExtra))
suppressPackageStartupMessages(library(magrittr))
suppressPackageStartupMessages(library(MSnbase))
suppressPackageStartupMessages(library(scpdata))
suppressPackageStartupMessages(library(tidyr))
suppressWarnings(suppressMessages(source("~/tmp/scp/R/utils-0.0.1.R")))
# This file can be found on Github "cvanderaa/scp/R/utils-0.0.1.R"
```
# Summary
Recent advances in sample preparation, processing and mass spectrometry (MS) have allowed the emergence of MS-based single-cell proteomics (SCP). However, bioinformatics tools to process and analyze these new types of data are still missing. In order to boost the development and the benchmarking of SCP methodologies, we are developing the `scpdata` experiment package. The package will distribute published and curated SCP data sets in standardized `Bioconductor` format.
# Introduction
There are two main types of MS-SCP data:
* Label-free proteomics: the nanoPOTS technology developed by Zhu and colleagues (@Zhu2018-bf) analyzes one sample/cell per MS run. Although the throughput is low (~10 samples/day), it allows for an accurate peptide quantification.
![](figs/nanoPOTS.png)
* TMT-based proteomics: the SCoPE pipeline developed by the Slavov Lab (@Budnik2018-qh) combines different samples/cells in a single MS run using tandem-mass tags (TMT) labeling. It increases the throughput and the identification rate, but reduces the quantification accuracy.
![](figs/scopems.png)
Data will be available for both techniques. This allows an informed choice for which statistical method to use for a given technique.
# Content of the package
`scpdata` contains SCP data sets formatted as `MSnSet`s (from the Bioconductor package `MSnbase`). There are different processing stages of MS data for SCP:
* Raw data: list of intensities with associated MS(x) and retention time information
* Peptide data: list of identified peptides
* Protein data: a table of protein x sample
Currently available data sets are listed using:
```{r scpdata, eval=FALSE}
scpdata()
```
```{r content, echo=FALSE}
desc <- scpdata()$result[, -c(1,2), drop=F]
kable(desc) %>%
kable_styling("striped")
```
Every data set has a help file that describes the content of the data set, how it was processed, and lists useful links for retrieving the original data. For instance, information about the `specht2019_peptide` data set can be retrieved using:
```{r man, eval=FALSE}
?specht2019_peptide
```
# Data manipulation
Thanks to the Bioconductor class `MSnSet`, data sets can easily be processed and analyzed in a standard and systematic way. For example, we re-implemented the R script provided in @Specht2019-jm as object oriented functions. The pipeline for processing peptide expression data becomes:
```{r data_manip, eval=FALSE, cache=TRUE}
data("specht2019_peptide")
specht2019_peptide %>%
scp_normalize_stat(what = "row", stats = mean, fun = "-") %>%
scp_aggregateByProtein() %>%
scp_normalize_stat(what = "column", stats = median, fun = "-") %>%
scp_normalize_stat(what = "row", stats = mean, fun = "-") %>%
imputeKNN(k = 3) %>%
batchCorrect(batch = "raw.file", target = "celltype") -> scpd
```
# Data quality control
While developing the SCoPE technology, the Slavov lab also suggested some quality control (QC) measures and visualizations (@Huffman2019-ns). The `scpdata` package provides an ideal working environment for developing and improving SCP data QC. For instance, the TMT channel intensity plot allows to quickly identify failed runs or failed sample preparations:
```{r QC, echo=FALSE, cache=TRUE, fig.width=10}
sc <- specht2019_peptide2
# Keep single cell runs FP94 and FP97
sel <- !grepl("blank|_QC_|col19|col2[0-4]", pData(sc)$run) &
grepl("FP9[47]", pData(sc)$run)
sc <- sc[, sel]
run <- "190222S_LCA9_X_FP94BF" # pData(sc)$run[1]
sc <- sc[, pData(sc)$run == run]
# Format the data
channel <- pData(sc)$channel
channel <- gsub("Reporter[.]intensity[.]", "", channel)
channel <- paste0("TMT", as.numeric(channel) + 1)
df <- data.frame(run = run, channel = channel,
cellType = pData(sc)$cell_type, t(exprs(sc)))
df <- pivot_longer(data = df, cols = -(1:3), values_to = "intensity")
# Get counts per channel
df %>% group_by(channel) %>%
summarise(max = max(intensity, na.rm = TRUE),
n = sum(!is.na(intensity)),
cellType = NA,
mean = mean(intensity, na.rm = TRUE),
median = median(intensity, na.rm = TRUE)) -> counts
# Create the plot
pl <- ggplot(data = df, aes(x = channel, y = intensity)) +
geom_violin(data = df, aes(fill = cellType), na.rm = TRUE) + scale_y_log10() +
geom_point(data = counts, aes(x = channel, y = median, shape = "+"),
color = "red", size = 5) +
geom_text(data = counts, aes(x = channel, y = max*2,
label = paste0("n=", n)),
size = 4, color = "grey50") +
scale_fill_manual(name = "Well type",
values = c(carrier_mix = "grey80", unused = "grey90",
norm = "skyblue3", sc_0 = "wheat",
sc_m0 = "#b3ba82", sc_u = "coral"),
limits = c("carrier_mix", "unused",
"norm", "sc_0", "sc_m0"),
labels = c(carrier_mix = "carrier (100c)",
norm = "reference (5c)",
sc_0 = "empty", unused = "unused",
sc_m0 = "macrophage (1c)")) +
scale_x_discrete(limits = paste0("TMT", 1:11)) +
scale_shape_manual(values = "+",
labels = c(`+` = "Median"),
name = "")
pl
.tmp <- graph2png(x = pl, file = "./figs/QC.png", aspectr = 2)
```
**Figure 1: MS intensity distributions per channel at the peptide level.** Contamination peptides or peptides with a low identification score were removed. Data taken from run `190222S_LCA9_X_FP94BF` published in @Specht2019-jm. `n` stands for the number of non-missing peptides.
# Data validation
There are 2 mindsets for data validation:
- Generate in vitro/silico simulated data sets: ground truth is known but could deviate from reality
- Generate data from biological samples with some prior knowledge (manual sampling, FACS staining, experimental design,...): close to a real experiment but no ground truth available
Dimension reduction and data visualization (PCA, tSNE, UMAP) is often used for validating data (Figure 2). When ground truth is available, we can further use benchmarking metrics: silhouette width, entropy of cluster accuracy, kBET for batch correction.
```{r PCA, echo=FALSE, cache=TRUE, warning=FALSE, message=FALSE, fig.width=10}
scpd <- specht2019_peptide
pca <- nipals(exprs(scpd), ncomp = 3, center = TRUE, scale = TRUE)
pl1 <- customPCA(scpd, pca, x = "PC2", y = "PC1", color = "celltype",
shape = "batch_chromatography") +
theme(legend.position = "none")
pl2 <- customPCA(scpd, pca, x = "PC3", y = "PC1", color = "celltype",
shape = "batch_chromatography") +
scale_color_manual(name = "Cell type",
values = c(sc_m0 = "skyblue3",
sc_u = "coral"),
labels = c(sc_m0 = "macrophages",
sc_u = "monocytes")) +
scale_shape(name = "Batch")
# Save plot
suppressMessages(
graph2png(x = plot(grid.arrange(pl1, pl2, ncol = 2, widths=c(3, 4))),
file = "./figs/PCA.png", aspectr = 2)
)
```
**Figure 2: PCA plot of peptide expression data**. Macrophages and monocytes are well separated in the third principal component. However, the first and second components are driven by batch effects. Monocytes are untreated U-937 cells, macrophages are U-937 treated with PMA for 24 hours. LCA10 and LCA9 are two chromatographic batches. The PCA was performed using the non-linear iterative partial least squares (NIPALS) algorithm that is robust against missing data.
# Problems to tackle
## Batch effects
Batch effects are inherent to MS-SCP data since many samples/cells cannot be analyzed in one go (even with TMT labeling). Different samples have to be distributed across different MS runs and this leads to tremendous batch effects (Figure 2). These batch effects should be corrected before performing downstream analysis to avoid biased conclusions.
## Missingness
The @Specht2019-jm data sets contains +/- 75 \% missing data. This needs to be accounted for by the statistical methods used for downstream analyses. Furthermore, conditions can be differentially affected by missingness that could lead to imputation-induced differential expression (Figure 3).
```{r missingness, echo=FALSE, fig.asp = 1}
sc <- specht2019_peptide
# Format the data
df <- do.call(cbind, lapply(unique(pData(sc)$celltype), function(x){
.sub <- exprs(sc)[, pData(sc)$celltype == x]
mis <- rowSums(is.na(.sub))/ncol(.sub)*100
logFC <- apply(.sub, 1, median, na.rm = TRUE)
out <- data.frame(mis, logFC)
colnames(out) <- paste0(c("mis", "logFc"), "_", x)
return(out)
}))
df$relFC <- df$logFc_sc_m0 - df$logFc_sc_u
df$relFC[df$relFC > 2] <- 2
df$relFC[df$relFC < -2] <- -2
# Scatter plot
sp <- ggplot(data = df, aes(x = mis_sc_m0, y = mis_sc_u, col = relFC),
size = 0.8) +
geom_point() +
scale_color_gradient2(name = "log2(rFC)", low = "darkgreen", breaks = -2:2,
labels = c("Monocyte", -1:1, "Macrophage"),
high = "red3", midpoint = 0, mid = "wheat") +
theme(legend.position = c(1, 0), legend.justification = c("right", "bottom"),
plot.margin = unit(c(0, 0, 0, 0), "cm"),
legend.background = element_rect(fill="transparent",
size=0.5, linetype="solid")) +
ylab("Missingness (%) in monocytes") + xlab("Missingness (%) in macrophages")
# Macrophage density plot
dp1 <- ggplot(data = df, aes(mis_sc_m0)) +
geom_density(fill = "grey") +
theme(axis.text = element_blank(), axis.title = element_blank(),
axis.ticks = element_blank(),
plot.margin = unit(c(0, 0, 0, 1), "cm"),
panel.background = element_blank())
# monocyte density plot
dp2 <- ggplot(data = df, aes(mis_sc_u)) +
geom_density(fill = "grey") +
coord_flip() +
theme(axis.text = element_blank(), axis.title = element_blank(),
axis.ticks = element_blank(),
plot.margin = unit(c(0, 0, 1, 0), "cm"),
panel.background = element_blank())
# Empty plot
blank <- ggplot() + geom_blank() + theme(panel.background = element_blank())
# Combine and save plots
pl <- plot_grid(dp1, blank, sp, dp2, align = "none",
rel_widths=c(10, 1), rel_heights=c(1, 10))
plot(pl)
suppressMessages(graph2png(x = pl, file = "./figs/missing.png", aspectr = 1))
```
**Figure 3: missingness in macrophages and monocytes**. Distribution of the proportion missing data in monocytes against macrophages. Color indicates the log2 fold change of macrophages (red) over monocytes (green).
## Curse of dimensionality
The current experiments are producing relatively small data sets (thousands of peptides x hundreds of cells) compared to single-cell transcriptomics. Nevertheless, the rapid growing of MS-SCP techniques will lead to data sets of much higher dimensionality. This is a challenge for both the statistical analyses and the software optimization. Possible solution should be inspired from current achievement in single cell transcriptomics.
# Conclusion
MS-based SCP is still at its infancy. However, we hope that developing an MS-SCP data package will provide a strong framework for bioinformatics research. It will support the development of new packages for tackling the statistical issues (missing data, batch effect, high dimensionality) seen in MS-SCP data, as well as providing a growing data repository for software benchmarking. By thorough implementation and benchmarking of its software, MS-SCP might become a new state-of-the-art technique for single-cell omcis.
# Acknowledgements
Research fellowship by the FRS-FNRS
# Reference