README.Rmd

---
output: github_document
---

<!-- README.md is generated from README.Rmd. Please edit that file -->

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/README-",
  out.width = "100%"
)
```

# MRcML for Independent Samples and Overlapping Samples

<!-- badges: start -->
<!-- badges: end -->

R package for Mendelian randomization with constraind maximum likelihood (MRcML) methods. Here is the reference for the original MRcML method with two independent samples: [**Constrained maximum likelihood-based Mendelian randomization robust to both correlated and uncorrelated pleiotropic effects**](https://www.cell.com/ajhg/pdfExtended/S0002-9297(21)00219-6).
Here is the reference for the extension of MRcML method with overlapping samples:
[**Combining Mendelian randomization and network deconvolution for inference of causal networks with GWAS summary data**](https://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1010762).

## Installation

<!-- You can install the released version of MRcML from [CRAN](https://CRAN.R-project.org) with:

``` r
install.packages("MRcML")
```
-->
 
Install the package from [GitHub](https://github.com/) with:

``` r
# install.packages("devtools")
devtools::install_github("xue-hr/MRcML")
```
## Example

Here is an example which shows how to apply MRcML methods to make inference about the causal effect from **Fast Glucose (FG)** to **Type-2 Diabetes (T2D)**.

```{r example}
library(MRcML)
summary(T2D_FG)
```

Example data `T2D_FG` is a list which contains estimated effects sizes and standard errors of 17 SNPs on T2D and FG. Now we perfrom the main function with sample size of FG which is 46186, and using 100 random start points. We set the random seed `random_seed = 1` to make sure results are replicable. First we use the function `mr_cML()`, which is for the two independent sample case. Then we apply the function `mr_cML_Overlap()`, which is for the overlapping sample case; for illustration purpose, here we assume the correlations between GWAS summary data being 0.1, i.e. setting `rho = 0.1`.

```{r}
### mr_cML() for two independent samples
cML_result = mr_cML(T2D_FG$b_exp,
                    T2D_FG$b_out,
                    T2D_FG$se_exp,
                    T2D_FG$se_out,
                    n = 46186,
                    random_start = 100,
                    random_seed = 1)

### mr_cML_Overlap() for two overlapping samples
cML_result_Overlap = mr_cML_Overlap(T2D_FG$b_exp,
                                    T2D_FG$b_out,
                                    T2D_FG$se_exp,
                                    T2D_FG$se_out,
                                    n = 46186,
                                    random_start = 100,
                                    random_seed = 1,
                                    rho = 0.1)

```

We get a warning message from the function `cML_estimate_random()`. The reason is: here we use 100 random starting points to minimize the non-convex loss function, some of them may not converge to a local minimum and result in Fisher Information matrices that are not positive definite. It is not likely affecting the optimization result, since in the end we only use the start point gives the minimum loss and discard all other start points including those do not converge. Now lets take a look at the results:

```{r}
cML_result
cML_result_Overlap
```

BIC selected model gives us indices of invalid IVs: 8, 12, 13, 15, 17. Now lets draw the scatter plot, invalid IVs are marked with blue:

```{r, echo = FALSE}
library(ggplot2)
plot_data = data.frame(b_exp = T2D_FG$b_exp,b_out = T2D_FG$b_out,
                           se_exp = T2D_FG$se_exp,se_out = T2D_FG$se_out)
    invalid_ind = as.factor(as.numeric(is.element(1:17,c(8, 12, 13, 15, 17))))
    plot_data = cbind(plot_data,invalid_ind = invalid_ind)
    
    fp2 = 
      ggplot(data = plot_data,aes(x = b_exp,y = b_out)) +
      geom_point(aes(col = invalid_ind)) + 
      geom_errorbar(aes(ymin = b_out-se_out,ymax = b_out+se_out,
                        col = invalid_ind),width=0) + 
      geom_errorbarh(aes(xmin = b_exp-se_exp, xmax = b_exp+se_exp,
                         col = invalid_ind),height=0) + 
      scale_color_manual(values = c("grey","blue")) + 
      theme(legend.position = "none") + 
      #geom_abline(intercept = 0,
      #            slope = c(TLP_result[i,c(1,BIC_min_ind)]),
      #            col = c("black","red"),
      #            linetype=c("dashed","solid"),
      #            size = c(1,1)
      #) + 
      geom_hline(yintercept = 0, linetype="solid", 
                 color = "black", size=0.5) +
      geom_vline(xintercept = 0, linetype="solid", 
                 color = "black", size=0.5) +
      labs(x = "Effect Size of FG",
           y = "Effect Size of T2D")
    
    fp2

```

Now let us perform cML with data perturbation. The default number of perturbations is 200, and we use 10 random start points. In real application, we recommend use more random start points to get reliable results even it takes more time, like 10 or even 100; in simulations the number of random start points could be set to 0 (i.e. do not use random start) to speed up.
First we use the function `mr_cML_DP()`, which is for the two independent sample case. Then we apply the function `mr_cML_DP_Overlap()`, which is for the overlapping sample case; again for illustration purpose, here we assume the correlations between GWAS summary data being 0.1, i.e. setting `rho = 0.1`.


```{r, warning=FALSE}
### mr_cML_DP() for two independent samples
cML_result_DP = mr_cML_DP(T2D_FG$b_exp,
                          T2D_FG$b_out,
                          T2D_FG$se_exp,
                          T2D_FG$se_out,
                          n = 46186,
                          random_start = 10,
                          random_start_pert = 10,
                          random_seed = 1,
                          num_pert = 200)

### mr_cML_DP_Overlap() for two overlapping samples
cML_result_DP_Overlap = mr_cML_DP_Overlap(b_exp = T2D_FG$b_exp,
                                          b_out = T2D_FG$b_out,
                                          se_exp = T2D_FG$se_exp,
                                          se_out = T2D_FG$se_out,
                                          n = 46186,
                                          random_start = 10,
                                          random_start_pert = 10,
                                          random_seed = 1,
                                          num_pert = 200,
                                          rho = 0.1)

```

Results with data perturbation:

```{r}
cML_result_DP
cML_result_DP_Overlap
```
<!-- You'll still need to render `README.Rmd` regularly, to keep `README.md` up-to-date.

You can also embed plots, for example:

```{r pressure, echo = FALSE}
plot(pressure)
```

In that case, don't forget to commit and push the resulting figure files, so they display on GitHub! -->