sparsecon

Scalable High-Dimensional Sparse Concentration Matrix Estimation

Summary: Implementing a new R package for scalable sparse concentration matrix estimation with efficient parallel processing.

Description: Under Gaussian modeling, sparse concentration matrix gives an undirected graphical structure of covariates called the Gaussian Markov random field. Such graphical structure is of interest to aid understanding of high-throughput genomics or large social networks data. Due to high dimensionality in data, efficient estimation through parallel computation is very desired. This new package will implement the neighborhood-selection algorithm by Meinshausen and Bühlmann (2006) and the algorithm based on Dantzig selector by Yuan (2010). Both algorithms fit very well for parallelization and can be implemented sharing a large portion of code. Their subproblems can be solved using existing R packages, such as glmnet and flare.

Several parallelization supporting packages will be considered, for shared-memory parallelism (e.g. parallel, Rdsm) and for distributed-memory parallelism (e.g. Rmpi, snowfall, or BatchExperiments).

Related work: There exist several R packages for sparse concentration matrix estimation. However, it is the case that either their underlying algorithm is not suitable for parallelization, or efficient parallel implementation in R is not yet considered. For example, glasso and dpglasso are based on the block coordinate descent algorithm, and flare is based on the ADMM (alternating direction method of multipliers). These algorithms, in their current form, are not well-suited for parallel implementation. On the other hand, QUIC and scalreg use algorithms potentially suitable for parallelization but not implemented with full consideration of parallelism in R.

Potential tasks:

Careful handling of shared data across threads in multi-core systems
Parallel implementation of multiple regression, using/implementing convex optimization algorithms (can base on existing R packages, e.g. glmnet or flare)
Design and implement an efficient solver for a simple constrained quadratic program (required for symmetrization of an outcome)
Consideration of graph output format compatible for graph visualization software (e.g. graphViz)

Skills required:

Descent programming experiences in R and parallel programming.
Basic understanding in convex optimization.
Basic understanding in statistics.

Test:

Solve an L1-penalized least-squares regression problem on random data of varying dimensions using the glmnet package.
Solve p instances of the above problem on a multi-core system using mclapply. Generate a data matrix of dimension n by p, shared across all threads, and solve the i-th instance of the problem on a submatrix of dimension (n by (p-1)), ignoring the i-th column of the generated matrix. Avoid excess memory copies whenever possible.
Create an R package with two functions performing the above two tasks, with proper documentation.

Mentor: Sangkyun Lee([@](mailto:sangkyun.lee {at} cs {dot} tu-dortmund {dot} de)) and TBD as backup mentor

References:

Meinshausen, N. and Bühlmann, P. (2006) High dimensional graphs and variable selection with the lasso. Annals of Statistics, 34, pp1436-1462.
Yuan, M. (2010) High Dimensional Inverse Covariance Matrix Estimation via Linear Programming, Journal of Machine Learning Research, 11, pp2261--2286.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

sparsecon

Scalable High-Dimensional Sparse Concentration Matrix Estimation

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally