Course-grained parallel thin-QR decomposition algorithms for tall-and-skinny matrices on a CPU in C++ using OpenMP.
I wanted to learn more about parallelizing programs for high-performance numerical computations on CPUs. The authors' Github that these algorithms are originally based on contains heterogeneous versions that are relatively difficult to understand without first seeing the pseudocode. Thus, this provided an opportunity to mesh my math and HPC interests to learn parallelization in practice, OpenMP, and do a QR decomposition math-refresh while providing some user-friendly(er) code.
The concepts and insights of this project are not novel, but I wanted to implement numerical algorithms from literature as a "warm-up" to a very interesting project that I begin working on very soon (and to show my C++ competence). This is a hint at said project's topic.
C++ implementation of novel parallel QR decomposition algorithms from this paper. I will implement the GPU-limited algorithms from its repository (EDIT AFTER IMPLEMENTATION: the GPU-limited algorithms were VERY slow as they were meant for GPUs).
Start by reading this paper for background. You may continue reading now.
Parallel algorithms like these significantly speed up least-squares regression and eigenvalue computations for PCA, among other relevant applications. Basically, data scientists will waste less time waiting for models to finish training and can iterate/improve solutions faster.
What this means for business people who don't care about any of that high-performance-numerical-computing-math stuff: the computer is faster and your engineers can make you more money.
exploration: I tinker with OpenMP to make sure it works on my PC.
utils: Contains the master helper file with all algorithm implementations (some algorithms are helper functions; this is convenient for testing).
tests: Speed and orthogonality error tests. The raw .csv
for the speed tests (in seconds) is in the /data
path.*
The remaining folders are named after their algorithm and contain a .cpp
file with the respective implementation.
*I know it is bad practice to put the .csv
in a Github repo, but its size is negligible and the raw data provides relevant insights.
Skip if you have an undergraduate-level understanding of numerical computing and parallelism.
The most common occurrence of these is design matrices for machine learning where the number of data points is much larger than the number of features. Other common applications of tall-and-skinny matrices are Fourier transforms in sensors, finite element methods, and the Jacobian matrices for iterative optimization algorithms.
Upper bound on approximation error for floating point computations. Unit roundoff is a synonym. It is denoted by
Given
Informally, this is the smallest difference between one floating point number and another.
Measures the sensitivity of the solution to pertubations in the input. It is typically denoted by
Suppose
Absolute condition number:
Relative condition number:
When using the
If a condition number is near
A matrix decomposition method known primarily for solving linear least squares via back-substitution (
A QR decomposition is "thin" when
QR decomposition is preferred to the normal equations for solving linear systems since the normal equations square the condition number and may lead to significant rounding errors when
There are various other special properties regarding the matrices in QR decomposition and variants of the algorithm that improve the condition number + computation speed. I implore you to discover them for yourself; this documentation is a very basic crash course.
The Gram-Schmidt algorithm finds an orthonormal basis for a set of linearly independent vectors.
Algorithm: Gram-Schmidt Orthogonalization
Input: Linearly independent vectors
Output: Orthonormal basis
-
Initialize:
- Set
$u_0 = v_0$ - Normalize:
$e_0 = \frac{u_0}{|u_0|}$
- Set
-
For
$i = 1\dots n + 1$ :- Set
$u_i = v_i$ -
For
$j = 0\dots i$ :- Compute projection:
$\text{proj}_{u_j}(v_i) = \frac{\langle v_i, u_j \rangle}{\langle u_j, u_j \rangle} u_j$ - Subtract projection:
$u_i = u_i - \text{proj}_{u_j}(v_j)$
- Compute projection:
- Normalize:
$e_i = \frac{u_i}{| u_i |}$
- Set
-
Return
${e_0, \dots, e_n}$
This orthogonalization method is relevant because $Q = \begin{bmatrix}e_0 \cdots e_n\end{bmatrix}$ (the orthonormal vectors) and
This will remain relatively high-level for brevity.
Decomposes a large computation into trivial tasks at an individual (or small block) level.
If I do a matrix multiplication, task size
Communication cost
Read more about how NVIDIA is taking advantage of shared memory here.
The work-to-communication ratio is very small, implying each processor performs few computations.
Decomposes a large computation into medium to large-sized tasks.
Following the matrix multiplication example,
Communication cost
Work-to-communication ratio is relatively large, implying each processor performs MANY computations.
Coarse-grained parallelization is better suited for problems limited by synchronization and communication latency such as in distributed databases where data is partitioned across nodes or in graph algorithms whose workload per edge/vertex varies greatly.
Strong scaling measures how execution time decreases as the number of processors
See Amdahl's law for more details. Hiring more people to paint a fence speeds it up, but adding too many does not since the work cannot be divided infinitely.
Weak scaling measures how execution time changes as
where
Informally, weak scaling and Gustafson's law explain that increasing the problem size and the number of processors results in near-linear speedups. Instead of painting a fence faster, paint a longer fence in the same amount of time by hiring more people.
See this link for the full pseudocode; it is not rewritten here for brevity.
Suppose
Algorithm: CholeskyQR
Input: Matrix
Output: Matrices
-
Construct Gram Matrix:
$W = A^T A$
-
Cholesky Factorization:
$W = R^T R$
-
Compute ( Q ):
$Q = AR^{-1}$
-
Return
$(Q, R)$
Algorithm: ParallelCholeskyQR
Input: Matrix
Output: Matrices
-
Initialize Variables
$\text{rows} \gets \text{rows}(A)$ $\text{cols} \gets \text{cols}(A)$ $\text{threads} \gets \text{max-threads()} $ -
$W$ as a zero matrix of size$n \times n$ -
$\text{local W}$ as an array of zero matrices$n \times n$ , one for each thread
-
Compute Gram Matrix in Parallel
-
Parallel for each
$\text{thread id} \in [0, ..., \text{threads} - 1]$ :$\text{chunk size} \gets \big\lfloor\frac{\text{rows}}{\text{threads}}\big\rfloor$ $\text{start} \gets \text{thread id} \times \text{chunk size}$ - $\text{end} \gets \begin{cases} \text{rows}, & \text{if thread id} = \text{threads} - 1 \ \text{start} +\text{chunk size}, & \text{otherwise} \end{cases}$
$A_i \gets A[\text{start}:\text{end}]$ $\text{local W}[\text{thread id}] \gets A_i^T A_i$ -
Critical Section:
$W \gets W + \text{local W}[\text{thread id}]$ $Q[\text{start}:\text{end}] \gets A_i$
-
Parallel for each
-
Perform Cholesky Factorization
$W = R^T R$
-
Compute
$Q$ in Parallel-
Parallel for each
$\text{thread id} \in [0, ..., \text{threads} - 1]$ :$Q[\text{start}:\text{end}] \gets Q[\text{start}:\text{end}] \times R^{-1}$
-
Parallel for each
-
Return
$(Q, R)$
The Gram Matrix is computed in parallel by slicing
Algorithm: CQR2
Input: Matrix
Output: Matrices
-
First QR Decomposition
$[Q_1, R_1] \gets \text{ParallelCholeskyQR}(A)$
-
Second QR Decomposition for Accuracy
$[Q, R_2] \gets \text{ParallelCholeskyQR}(Q_1)$
-
Compute Final
$R$ $R \gets R_2 R_1$
-
Return
$(Q, R)$
CQR can produce non-orthogonal vectors, becoming unstable as the condition number increases. Repeating orthogonalization improves stability, as detailed here. Orthogonality error scales as
- Performs QR decomposition with a stability shift applied to the diagonal of the Gram matrix.
- Parallel implementation of the Shifted Cholesky QR algorithm.
- Performs QR decomposition using a combination of shifted Cholesky QR and Cholesky QR 2.
- Combines Cholesky QR with Gram-Schmidt orthogonalization for block-wise processing.
- Performs two iterations of Cholesky QR with Gram-Schmidt orthogonalization.
- Distributed implementation of Cholesky QR with Gram-Schmidt orthogonalization.
- Modified version of Cholesky QR2 with Gram-Schmidt, incorporating reorthogonalization and parallel processing.
Much more important work came up, and I accomplished the minimal acceptable output for this project.
-
Figure out how to run C++ programs on PC without breaking -
Implement simple parallel applications (for loops, other basics) -
Implement iterative Cholesky (used for speed comparison) -
Implement parallel Cholesky (used for speed comparison) -
Implement CholeskyQR2 -
Implement sCQR3 -
Implement CholeskyQR2 with Gram-Schmidt (CQRGS, CQR2GS) -
Implement Distributed Cholesky QR with blocked GS (dCQRbGS) -
Implement Modified Cholesky QRwGS -
Implement mCQR2GS (test THEN potentiall revert indexing, parallelized panels if computation slower) -
Accuracy test: CholeskyQR2, sCQR, sCQR3, CQRGS, CQR2GS, dCQRGS, mCQR2GS -
Fix CQRGS, dCQRGS, mCQR2GS -
Speed test: CQR2GS, dCQRbGS, mCQR2GS (run the tests) -
Speed refactor
a.
Goal is to make these significantly faster than CQR while preserving orthogonal stability gainsb.
Flame graph to find overheadc.
Write out algo in ONE function to find computation reductionsd. Code speed optimization
i. Own functions (see flame graph) 1. Cholesky QR2 with Gram Schmidt a. Use `const`, `constexpr`, and proper C++ objects for clarity and speed b. Mathematical manipulations/simplifications 2. Modified Cholesky QR2 with Gram Schmidt a. Use `const`, `constexpr`, and proper C++ objects for clarity and speed b. Mathematical manipulations/simplifications 3. Parallel CQR a. Use `const`, `constexpr`, and proper C++ objects for clarity and speed b. Mathematical manipulations/simplifications ii. Comparison functions 1. LAPACK 2. Intel MKL 3. Eigen 4. Armadillo
e. After editing in helper, insert updated functions back into original file(s).
-
GENERAL CODE CLEANUP
-
Write description
I am not optimizing distributed_cholesky_QR_w_gram_schmidt
because it was meant to run on a CPU/GPU mix, and I am only running on a CPU for this project.