-
Notifications
You must be signed in to change notification settings - Fork 29
sparsecon
Summary: Implementing a new R package for scalable sparse concentration matrix estimation with efficient parallel processing.
Description: Under Gaussian modeling, sparse concentration matrix gives an undirected graphical structure of covariates called the Gaussian Markov random field. Such graphical structure is of interest to aid understanding of high-throughput genomics or large social networks data. Due to high dimensionality in data, efficient estimation through parallel computation is very desired. This new package will implement the neighborhood-selection algorithm by Meinshausen and Bühlmann (2006) and the algorithm based on Dantzig selector by Yuan (2010). Both algorithms fit very well for parallelization and can be implemented sharing a large portion of code. Their subproblems can be solved using existing R packages, such as glmnet
and flare
.
Several parallelization supporting packages will be considered, for shared-memory parallelism (e.g. parallel
, Rdsm
) and for distributed-memory parallelism (e.g. Rmpi
, snowfall
, or BatchExperiments
).
Related work: There exist several R packages for sparse concentration matrix estimation. However, it is the case that either their underlying algorithm is not suitable for parallelization, or efficient parallel implementation in R is not yet considered. For example, glasso
and dpglasso
are based on the block coordinate descent algorithm, and flare
is based on the ADMM (alternating direction method of multipliers). These algorithms, in their current form, are not well-suited for parallel implementation. On the other hand, QUIC
and scalreg
use algorithms potentially suitable for parallelization but not implemented with full consideration of parallelism in R.
Potential tasks:
- Careful handling of shared data across threads in multi-core systems
- Parallel implementation of multiple regression, using/implementing convex optimization algorithms (can base on existing R packages, e.g.
glmnet
orflare
) - Design and implement an efficient solver for a simple constrained quadratic program (required for symmetrization of an outcome)
- Consideration of graph output format compatible for graph visualization software (e.g. graphViz)
Skills required:
- Descent programming experiences in R and parallel programming.
- Basic understanding in convex optimization.
- Basic understanding in statistics.
Test:
- Solve an L1-penalized least-squares regression problem on random data of varying dimensions using the
glmnet
package. - Solve p instances of the above problem on a multi-core system using
mclapply
. Generate a data matrix of dimension n by p, shared across all threads, and solve the i-th instance of the problem on a submatrix of dimension (n by (p-1)), ignoring the i-th column of the generated matrix. Avoid excess memory copies whenever possible. - Create an R package with two functions performing the above two tasks, with proper documentation.
Mentor: Sangkyun Lee([@](mailto:sangkyun.lee {at} cs {dot} tu-dortmund {dot} de)) and TBD as backup mentor
References:
-
Meinshausen, N. and Bühlmann, P. (2006) High dimensional graphs and variable selection with the lasso. Annals of Statistics, 34, pp1436-1462.
-
Yuan, M. (2010) High Dimensional Inverse Covariance Matrix Estimation via Linear Programming, Journal of Machine Learning Research, 11, pp2261--2286.