Practical 2

Practical 2: Performing a multi-omic analysis of cancer data with R

performing uni-omic and multi-omic molecular classifications
interpreting the results and finding clinical implications

Intro

The slides, along with the relevant slides from lecture 2 are available here.

Prerequisites

The data for the practical is in the github repository, in medical_genomics/Practical2/Data (e.g., retrieved with the command line git clone https://github.com/IARCbioinfo/medical_genomics_course)
R (v>=4.0) and python (v>3.0) installed
MOFA+ installed

Preferred way of running the practical: mybinder

Alternative 1: use rstudio cloud

On https://rstudio.cloud/, login and create a session

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("MOFA2")
 system("pip install mofapy2")

Note: To retrieve the data for the practical within rstudio server cloud, you can type inside your rstudio session:

system("git clone https://github.com/IARCbioinfo/medical_genomics_course")

Alternative 2: Singularity

You can use the singularity container to avoid installing R and MOFA2

singularity pull docker://iarcbioinfo/medical_genomics_course:practical2

Then run the container and R within the container:

singularity shell medical_genomics_course_practical2.sif R

Note: if you want a particular folder to be visible within singularity (in particular shared drives not in your working computer), use the singularity option "-B path:/mnt", where path is the original path of the folder in your computer, and /mnt is the location you will see within the singularity container

Check MOFA2 install

Tu make sure that MOFA2 is well installed with all dependencies, you can run the simple code from the help from the run_mofa function:

library(MOFA2)
file <- system.file("extdata", "test_data.RData", package = "MOFA2")
load(file)
# Create the MOFA object
MOFAmodel <- create_mofa(dt)
# Prepare the MOFA object with default options
MOFAmodel <- prepare_mofa(MOFAmodel)
# Run the MOFA model
MOFAmodel <- run_mofa(MOFAmodel)

Note: if you run into an error, try installing the mofapy2 python package system("pip install mofapy2")

Steps

Data preprocessing

Question 1: Loading the data Get the clinical data and the methylation data https://github.com/IARCbioinfo/medical_genomics_course/tree/main/Practical2/Data

a) Load the tab-separated-values files containing the clinical data (medical_genomics_course/Practical2/Data/Data.Clin.txt), the RNA expression data matrix (medical_genomics_course/Practical2/Data/RNA.tsv), and the 3 DNA mehtylation matrices (medical_genomics_course/Practical2/Data/DNAMeth_promoter.tsv , DNAMeth_genebody.tsv , and DNAMeth_enhancer.tsv) using your favorite file reading function (e.g., the read.table function from base R with sep = "\t" as argument, or the read_tsv function from package dplyr)

b) Check the objects you just created by using head and dim functions. What does each dimension represent? Does it matter if the dimensions of the matrices are different?

Question 2: Create a MOFA object

a) Visually check that each data set distribution approximately follows a Gaussian or a mixture of Gaussian distributions. Why is this assumption important?

b) Using the create_mofa function, create a MOFA object containing the resulting data sets.

Training the model

Question 3: MOFA run settings and training the MOFA model

a) By using get_default_xxx_options functions, set the data, model, and training options to the default values except for the number of factors, choose 5 factors and the convergence mode, choose “slow” mode.

b) Using the prepare_mofa function, prepare the created MOFA object for proper training.

c) Run MOFA using the run_mofa function and save the resulting MOFA model locally. What is printed during the model training? When does the training stop?

Quality control

Question 4: MOFA run quality controls

a) Using the plot_variance_explained function, get the contribution of each data set in the MOFA axes definition. Write a short description about the resulting plot.

b) Check for outliers on all the MOFA axes, in your opinion, which axes should be removed from the downstream analyses? What can be the cause of such outliers?

c) Using the plot_factor_cor function, assess the uniqueness of each MOFA axis. What would you do if some factors were highly correlated?

Bonus question: Comparison with uni-omic unsupervised analyses

a) Using the ade4 package, perform PCA using, first, RNA-seq data (PCA-exp) and second, DNA methylation data (PCA-meth).

b) Compare the resulting PCA-exp and PCA-meth with MOFA. Complete your quality control on MOFA.

Analysis and interpretation

Question 5: Downstream analyses - part 1

a) Which parameter drives the MOFA latent factors (LFs) order?

b) Plot the LF1-LF2 space and describe the distribution of the samples. Are histopathological types separated on LF1-LF2? What about molecular clusters?

b) Using the run_umap function, compare your first observation of the LF1-LF2 space with umap using the 5 first LFs defined by MOFA. Are histopathological types better separated? What about molecular clusters?

c) Using plot_weights function, identify the 10 genes for which the expression data contribute the most to LF1 and LF2, respectively. What are the top 10 enhancer CpGs contributing the most to LF1? Give the related genes using the Manifest file.

d) Using the GOrilla web page and the RNA-seq data set, perform a GSEA giving the most contributing pathways to the LF1 and LF2 definition, respectively. What are the common pathways?

Question 6: Downstream analyses - part 2

a) In a table, provide the statistical association test results between clinical annotations given in the Clinical table and the 10 LFs defined by MOFA. Don’t forget the multiple-testing correction. Which clinical associations can you underline?

b) Using k-means clustering, identify the clusters of samples in the LF1-LF2 space. What kind of supervised analyses can be used to characterise these groups of samples?

Bonus question: Downstream analyses - bonus

a) Identify COSMIC genes, highly involved in the definition of LF1 and LF2 by their expression data using plot_weights function.

b) Using the DESeq2 package, perform a differential expression analyses on two of the clusters you identify in Question 7. What are the COSMIC genes in common between the list of differentially expressed genes and the one identified in point a)?

Resources

MOFA article and MOFA+ article
MOFA+ R vignette
MOFA slack channel, with very responsive developers and a very active community

alcalan@iarc.who.int (Nicolas Alcala)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly