Course-grained parallel thin-QR decomposition algorithms for tall-and-skinny matrices on a CPU in C++ using OpenMP.
I wanted to learn more about parallelizing programs for high-performance numerical computations on CPUs. The authors' Github that these algorithms are originally based on contains heterogeneous versions that are relatively difficult to understand without first seeing the pseudocode. Thus, this provided an opportunity to mesh my math and HPC interests to learn parallelization in practice, OpenMP, and do a QR decomposition math-refresh while providing some user-friendly(er) code.
The concepts and insights of this project are not novel, but I wanted to implement numerical algorithms from literature as a "warm-up" to a very interesting project that I begin working on very soon (and to show my C++ competence). This is a hint at said project's topic.
C++ implementation of novel parallel QR decomposition algorithms from this paper. I will implement the GPU-limited algorithms from its repository (EDIT AFTER IMPLEMENTATION: the GPU-limited algorithms were VERY slow as they were meant for GPUs).
Start by reading this paper for background. You may continue reading now.
Parallel algorithms like these significantly speed up least-squares regression and eigenvalue computations for PCA, among other relevant applications. Basically, data scientists will waste less time waiting for models to finish training and can iterate/improve solutions faster.
What this means for business people who don't care about any of that high-performance-numerical-computing-math stuff: the computer is faster and your engineers can make you more money.
exploration: I tinker with OpenMP to make sure it works on my PC.
utils: Contains the master helper file with all algorithm implementations (some algorithms are helper functions; this is convenient for testing).
tests: Speed and orthogonality error tests. The raw .csv
for the speed tests (in seconds) is in the /data
path.*
The remaining folders are named after their algorithm and contain a .cpp
file with the respective implementation.
*I know it is bad practice to put the .csv
in a Github repo, but its size is negligible and the raw data provides relevant insights.
Skip if you have an undergraduate-level understanding of numerical computing and parallelism.
The most common occurrence of these is design matrices for machine learning where the number of data points is much larger than the number of features. Other common applications of tall-and-skinny matrices are Fourier transforms in sensors, finite element methods, and the Jacobian matrices for iterative optimization algorithms.
Upper bound on approximation error for floating point computations. Unit roundoff is a synonym. It is denoted by
Given
Informally, this is the smallest difference between one floating point number and another.
Measures the sensitivity of the solution to pertubations in the input. It is typically denoted by
Suppose
Absolute condition number:
Relative condition number:
When using the
If a condition number is near
A matrix decomposition method known primarily for solving linear least squares via back-substitution (
A QR decomposition is "thin" when
QR decomposition is preferred to the normal equations for solving linear systems since the normal equations square the condition number and may lead to significant rounding errors when
There are various other special properties regarding the matrices in QR decomposition and variants of the algorithm that improve the condition number + computation speed. I implore you to discover them for yourself; this documentation is a very basic crash course.
The Gram-Schmidt algorithm finds an orthonormal basis for a set of linearly independent vectors.
Algorithm: Gram-Schmidt Orthogonalization
Input: Linearly independent vectors
Output: Orthonormal basis
-
Initialize:
- Set
- Normalize:
- Set
-
For
: - Set
-
For
: - Compute projection:
- Subtract projection:
- Compute projection:
- Normalize:
- Set
-
Return
This orthogonalization method is relevant because $Q = \begin{bmatrix}e_0 \cdots e_n\end{bmatrix}$ (the orthonormal vectors) and
This will remain relatively high-level for brevity.
Decomposes a large computation into trivial tasks at an individual (or small block) level.
If I do a matrix multiplication, task size
Communication cost
Read more about how NVIDIA is taking advantage of shared memory here.
The work-to-communication ratio is very small, implying each processor performs few computations.
Decomposes a large computation into medium to large-sized tasks.
Following the matrix multiplication example,
Communication cost
Work-to-communication ratio is relatively large, implying each processor performs MANY computations.
Coarse-grained parallelization is better suited for problems limited by synchronization and communication latency such as in distributed databases where data is partitioned across nodes or in graph algorithms whose workload per edge/vertex varies greatly.
Strong scaling measures how execution time decreases as the number of processors
See Amdahl's law for more details. Hiring more people to paint a fence speeds it up, but adding too many does not since the work cannot be divided infinitely.
Weak scaling measures how execution time changes as
where
Informally, weak scaling and Gustafson's law explain that increasing the problem size and the number of processors results in near-linear speedups. Instead of painting a fence faster, paint a longer fence in the same amount of time by hiring more people.
See this link for the full pseudocode; it is not ALL rewritten here for brevity.
Suppose
Algorithm: CholeskyQR
Input: Matrix
Output: Matrices
-
Construct Gram Matrix:
-
Cholesky Factorization:
-
Compute ( Q ):
-
Return
Algorithm: ParallelCholeskyQR
Input: Matrix
Output: Matrices
-
Initialize Variables
-
as a zero matrix of size -
as an array of zero matrices , one for each thread
-
Compute Gram Matrix in Parallel
-
Parallel for each
: - $\text{end} \gets \begin{cases} \text{rows}, & \text{if thread id} = \text{threads} - 1 \ \text{start} +\text{chunk size}, & \text{otherwise} \end{cases}$
-
Critical Section:
-
Parallel for each
-
Perform Cholesky Factorization
-
Compute
in Parallel -
Parallel for each
:
-
Parallel for each
-
Return
The Gram Matrix is computed in parallel by slicing
Algorithm: CQR2
Input: Matrix
Output: Matrices
-
First QR Decomposition
-
Second QR Decomposition for Accuracy
-
Compute Final
-
Return
CQR can produce non-orthogonal vectors, becoming unstable as the condition number increases. Repeating orthogonalization improves stability, as detailed here. Orthogonality error scales as
From here I will ONLY be giving a brief explanation of each algorithm. See the paper for pseudocode.
A shift
This is essentially CQR2 but instead of applying CQR twice, it applies sCQR as a preconditioner to CQR2, which achieves further orthogonalization.
Similar to CQR but with block processing and panel update/reorthogonalization before computing the final
CQR2 with CQRGS instead of CQR, leveraging parallel block processing and Gram-Schmidt reorthogonalization. It improves stability, efficiency, and accuracy while optimizing computational cost.
mCQR2GS restructures CQRGS to reduce the number of panels while maintaining computational and communication efficiency. It adaptively selects the paneling strategy based on matrix conditioning, ensuring stability with fewer operations. Compared to CQR2GS, mCQR2GS requires fewer floating-point operations by avoiding explicit factor construction and achieves better orthogonality with fewer panels for high-condition-number matrices.
This project served as a hands-on exploration of parallel QR decomposition using OpenMP, blending high-performance computing with numerical linear algebra.
While the concepts are not novel, implementing them deepened my understanding of parallelization and provided a practical refresh on QR decomposition. The algorithms showcased here accelerate least-squares regression and eigenvalue computations, making large-scale data analysis more efficient. This lays the groundwork for a more advanced project I will begin soon relating to Asian options.
I used the Intel VTune flame graph for performance analysis and stack trace inspection.
I burnt out writing the documentation and realized I want to spend more time writing code instead of these words that people won't read. For this reason, I will likely start contributing to some kind of relevant open-source project related to numerical/parallel computing moving forward because writing documentation is time-intensive and (usually) not very exciting. I also want to spend more time watching "The Walking Dead" after work. I watched through season 7 a few years ago but never saw the whole show. I also want to watch the spinoffs. Realistically, I'll start another project while watching it because I want to learn more things, advance my career, and all that other stuff, but I thought I would let you know where I'm at.