README.Rmd

---
output: github_document
---

# Manual for the use of the RCM functions

## Publication

The underlying method of the RCM package is described in detail in the following article: ["A unified framework for unconstrained and constrained ordination of microbiome read count data"](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0205474).

## Install and load packages

This repo contains R-code to fit and plot the RC(M)-models augmented with the negative binomial. The basic usage is demonstrated here, for more advanced examples see the [RCM vignette](https://bioconductor.org/packages/release/bioc/vignettes/RCM/inst/doc/RCMvignette.html) and the help pages of _RCM()_, *RCM_NB()*, and _plot.RCM()_ functions.

The package can be installed from BioConductor and loaded using the following commands:

```{r installRCMpackageBioConductor, eval = FALSE}
library(BiocManager)
install("RCM")
```

```{r loadRCMpackage}
suppressPackageStartupMessages(library(RCM))
cat("RCM package version", as.character(packageVersion("RCM")), "\n")
```

Alternatively, the latest version can be installed directly from this GitHub repo as follows:

```{r installRCMpackage, eval = FALSE}
library(devtools)
install_github("CenterForStatistics-UGent/RCM")
```


```{r check-install-load-packages, warning=FALSE, message=FALSE, echo=FALSE, purl=TRUE}
knitr::opts_chunk$set(cache = TRUE, autodep = TRUE,  warning=FALSE, message=FALSE, echo=TRUE, eval=TRUE, tidy = TRUE, fig.width = 9, fig.height = 6, purl=TRUE, fig.show = "hold", cache.lazy = FALSE, fig.path = "README_figs/README-")
knitr::opts_knit$set(root.dir = "/home/sthaw/PostDoc/Ordination/RCM")
palStore = palette()
```

## Dataset

As example data we use a study on the microbiome of colorectal cancer patients "Potential of fecal microbiota for early-stage detection of colorectal cancer" (2014) by Zeller _et al._.

```{r loadZellerData}
library(phyloseq)
data(Zeller)
```

The _Zeller_ object is a phyloseq object, which contains all the information of a microbiome experiment. The _RCM_ package is tailor-made for phyloseq objects. We therefore strongly recommend building a phyloseq object out of your data, and feeding that object into the _RCM()_ function. More information on building the phyloseq object can be found [here](http://joey711.github.io/phyloseq/import-data.html).

## Unconstrained RCM

### Fitting the unconstrained RCM model

The unconstrained RC(M) method represents all variability present in the data, regardless of covariate information. It should be used as a first step in an exploratory analysis. The _RCM_ function is the front end function for this purpose, behind it is the *RCM_NB* function that does the hard work but requires numeric matrix imputs. We apply it to the Zeller data.

```{r fitUnconstrainedRCM, eval = FALSE}
ZellerRCM2 = RCM(Zeller)
```

```{r fitHidden, echo = FALSE}
if(!file.exists(file = "./inst/fits/zellerFits.RData")){
ZellerRCM2 = RCM(Zeller)
ZellerRCM3 = RCM(Zeller, k = 3, round = TRUE)
ZellerRCM2cond = RCM(Zeller, confounders = "Country")
ZellerRCM2constr = RCM(Zeller, k = 2, round = TRUE, covariates = c("Age","Gender","BMI","Country", "Diagnosis"), responseFun = "linear")
ZellerRCM2constrNonParam = RCM(Zeller, round = TRUE, k = 2, covariates = c("Age","Gender","BMI","Country", "Diagnosis"), responseFun = "nonparametric")
save(ZellerRCM2, ZellerRCM3, ZellerRCM2cond, ZellerRCM2constr, ZellerRCM2constrNonParam, file = "./inst/fits/zellerFits.RData")
} else {load(file = "./inst/fits/zellerFits.RData")}
```

which took `r round(ZellerRCM2$runtime,1)` minutes.

### Plotting the unconstrained RCM

A simple plot-command will yield a plot of the RCM ordination

```{r plotUnconstrainedRCMall}
plot(ZellerRCM2)
```

Samples are represented by dots, taxa by arrows. Both represent vectors with the origin as starting point.

Valid interpretatations are the following:

 - Samples (endpoints of sample vectors, the red dots) close together have a similar taxon composition
 - Taxa are more abundant than average in samples to which their arrows point, and less abundant when their arrows point away from the samples. For example Fusobacterium mortiferum is more abundant than average in samples on the left side of the plot, and more abundant in samples on the right side.
 - The importance parameters $\psi$ shown for every axis reflect the relative importance of the dimensions
 
Distances between endpoints of taxon vectors are meaningless.

The plot can be made more interpretable by adding some colour, e.g. by colouring the samples by cancer diagnosis. For this one can simply provide the name of the sample variable as present in the phyloseq object. For more plotting options see _?plot.RCM_.

```{r plotUnconstrainedRCMallColour}
plot(ZellerRCM2, samColour = "Diagnosis")
```

Any richness measure defined in the _phyloseq_ package (see _?estimateRichness_) can also be supplied as sample colour.

```{r plotRichness}
plot(ZellerRCM2, samColour = "Shannon")
```

#### Conditioning

In order to condition on certain variables, supply their names to the _confounders_ argument. Here we condition on the _country_ variable

```{r condition, eval = FALSE}
ZellerRCM2cond = RCM(Zeller, confounders = "Country")
```

Conditioning can be applied for unconstrained as well as constrained RCM, the ensuing plots have exactly the same interpretation as before, except that all variability attributable to the confounding variables has been removed.

```{r plotCond}
plot(ZellerRCM2cond)
```

### PERMANOVA

PERMANOVA is an analysis that looks for significant clustering of samples in the ordination plot according to predefined sample groups, i.e. not for clusters identified _based on_ the ordination. The test is based on a pseudo F-statistic. The function accepts variables inside the phyloseq object as groups, or user supplied grouping factors.

```{r permanova}
permanovaZeller = permanova(ZellerRCM2, "Diagnosis")
permanovaZeller
```

This PERMANOVA analysis was not run in the original paper, but is a later addition.

## Constrained RCM

In this second step we look for the variability in the dataset explained sample-specific variables. This should be done preferably only with variables that are believed to have an impact on the species' abundances. Here we used the variables age, gender, BMI, country and diagnosis in the gradient.
In this analysis all covariates values of a sample are projected onto a single scalar, the _environmental score_ of this sample. The projection vector is called the _environmental gradient_, the magnitude of its components reveals the importance of each variable. The taxon-wise response functions then describe how the logged mean abundance depends on the environmental score.

### Fitting the constrained RCM model

In order to request a constrained RCM fit it suffises to supply the names of the constraining variables to the _covariates_ argument. The shape of the response function can be either "linear", "quadratic" or "nonparametric" and must be provided to the _responseFun_ argument. Here we illustrate the use of linear and nonparametric response functions. Linear response functions may be too simplistic, they have the advantage of being easy to interpret (and plot). Non-parametric ones (based on splines) are more flexible but are harder to plot.

```{r constrLinAndNP, eval = FALSE}
#Linear
ZellerRCM2constr = RCM(Zeller, covariates = c("Age","Gender","BMI","Country", "Diagnosis"), responseFun = "linear")
#Nonparametric
ZellerRCM2constrNonParam = RCM(Zeller, covariates = c("Age","Gender","BMI","Country", "Diagnosis"), responseFun = "nonparametric")
```

#### Biplots

In the constrained case two different biplots are meaningful: sample-taxon biplots and variable-taxon biplots.

##### Sample-taxon biplot

```{r plotlin2cor}
plot(ZellerRCM2constr, plotType = c("species", "samples"))
```

The interpretation is similar as before: the orthogonal projection of a taxon's arrow on a sample represents the departure from independence for that taxon in that sample, _explained by environmental variables_. New is also that the taxa arrows do not start from the origin, but all have their own starting point. This starting point represents the environmental scores for which there is no departure from independence. The direction of the arrow then represents in which direction of the environmental gradient its expected abundance increases.

##### Variable-taxon biplot

```{r plotlin3}
plot(ZellerRCM2constr, plotType = c("species", "variables"))
```

The projection of species arrows on environmental variables (starting from the origin) represents the sensitivity of this taxon to changes in this variables. E.g. Pseudomonas fluorescens is much more abundant in healthy and small adenoma patients than in cancer patients. Note that the fact that the arrows of BMI and gender are of similar length indicates that one _standard deviation_ in BMI has a similar effect to gender.

We observe that country and diagnosis are the main drivers of the environmental gradient. We also see that healthy and small adenoma patients have similar taxon compositions, which are very different from the taxon composition of cancer patients.

#### Triplot: linear response functions

Triplots combine all the three components in a single plot. This is the default behaviour of the _plot.RCM_ function.

```{r plotlin3Triplot}
plot(ZellerRCM2constr, samColour = "Diagnosis")
```

Note that the samples and the environmental variables cannot be related to each other, but the sample-taxon and variable-taxon relationships discussed before are still valid on the triplot.

#### Triplot: Non-parametric response functions

For the non-parametric response functions we can only plot one dimensional triplots, whereby the shape of the response functions of the most strongly reacting taxa is shown. For this we use the _plotRespFun_ function.

```{r plotNPTriplot}
plotRespFun(ZellerRCM2constrNonParam)
```

The first dimension is shown by default, the environmental scores of the samples are shown below as ticks, the y-axis shows the response functions of the 8 most reacting species, whereby the x-axis represents the independence model. This graph demonstrates the power of non-parametric response function modelling to unravel differences in species niches.

This document covers only the more basic use of the RCM plotting functions. Discover more advanced features in the [RCM vignette](https://bioconductor.org/packages/release/bioc/vignettes/RCM/inst/doc/RCMvignette.html) or in the help pages of the functions used.