This repository contains an implementation of the paper
- Li, C., Shen, X., & Pan, W. (2023). Inference for a large directed acyclic graph with unspecified interventions. Journal of Machine Learning Research. (A previous version was given the Student Paper Award, Conference on Statistical Learning and Data Science, ASA, May 2020)
First, install the package glmtlp which implements the DC projection algorithm in the paper.
devtools::install_github("chunlinli/glmtlp")
Then install the package intdag which offers the peeling causal discovery method and data-perturbation/asymptotic inference.
devtools::install_github("chunlinli/intdag/intdag")
Before proceeding, source the following R scripts for utility functions required in the following illustration.
Assume the working directory is the cloned repository https://github.com/chunlinli/intdag.
source("utility.R")
set.seed(1110)
First, we generate a random graph with p=10
and q=20
.
p <- 10
q <- 20
graph <- graph_generation(p = p, q = q, graph_type = "random", iv_sufficient = FALSE)
Note that we use the option iv_sufficient = FALSE
, which means a crucial assumption in the paper -- Assumption 1C is violated.
Let's print the
graph$u
This is a p
by p
adjacency matrix of the DAG that we want to recover and/or make inference. Note that
Then print the
graph$w
This is a q
by p
matrix, indicating the interventional relations of an intervention varibale w
corresponds to the simulation Setup B in the paper.
Next, generate a random sample of size n=200
based on the graph. According to the analysis in the paper, the distribution of intervention variables
n <- 200
x <- matrix(rnorm(n * q), nrow = n, ncol = q)
rho <- 0.5
if (rho != 0) {
for (j in 2:q) {
x[, j] <- sqrt(1 - rho^2) * x[, j] + rho * x[, j - 1]
}
}
Note that x
is an n
by q
matrix.
Then we generate the
y <- (x %*% graph$w + rmvnorm(n, sigma = diag(seq(from = 0.5, to = 1, length.out = p), p, p))) %*% solve(diag(p) - graph$u)
Here y
is an n
by p
matrix.
Now fit the model using intdag
. Note that intdag
calls glmtlp
for solving multi-response regression in its first step.
v_out <- v_estimation(y = y, x = x, model_selection = "bic")
Second, use peeling algorithm to recover the topological layers.
top_out <- topological_order(v_out$v)
an_mat <- top_out$an_mat
in_mat <- top_out$in_mat
Third, refit to estimate
discovery_out <- causal_discovery(y = y, x = x, an_mat = an_mat, in_mat = in_mat)
The final estimate for
discovery_out$u
We compare the final estimate with the true graph in the structural Hamming distance.
sum(abs((abs(discovery_out$u) > 0.05) - (graph$u != 0)))
Here we use a truncation threshold 0.05
to screen the small (noisy) coefficients.
The R package is placed in directory ./intdag/
.
The extensive simulation code is placed in directory ./simulation/
.
If you find the code useful, please consider citing
@article{li2023inference,
author = {Chunlin Li, Xiaotong Shen, Wei Pan},
title = {Inference for a large directed acyclic graph with unspecified interventions},
journal = {Journal of Machine Learning Research},
year = {2023}
}
Implementing these algorithms is error-prone and this code project is still in development. Please file an issue if you encounter any error when using the code. I will be grateful to be informed.