Skip to content

Files

Latest commit

 

History

History
271 lines (152 loc) · 16.8 KB

README.md

File metadata and controls

271 lines (152 loc) · 16.8 KB

Parallel-QR-Decomposition

Course-grained parallel thin-QR decomposition algorithms for tall-and-skinny matrices on a CPU in C++ using OpenMP.

Precursor

Why I did this project:

I wanted to learn more about parallelizing programs for high-performance numerical computations on CPUs. The authors' Github that these algorithms are originally based on contains heterogeneous versions that are relatively difficult to understand without first seeing the pseudocode. Thus, this provided an opportunity to mesh my math and HPC interests to learn parallelization in practice, OpenMP, and do a QR decomposition math-refresh while providing some user-friendly(er) code.

The concepts and insights of this project are not novel, but I wanted to implement numerical algorithms from literature as a "warm-up" to a very interesting project that I begin working on very soon (and to show my C++ competence). This is a hint at said project's topic.

What this project is:

C++ implementation of novel parallel QR decomposition algorithms from this paper. I will implement the GPU-limited algorithms from its repository (EDIT AFTER IMPLEMENTATION: the GPU-limited algorithms were VERY slow as they were meant for GPUs).

Start by reading this paper for background. You may continue reading now.

Why you should care:

Parallel algorithms like these significantly speed up least-squares regression and eigenvalue computations for PCA, among other relevant applications. Basically, data scientists will waste less time waiting for models to finish training and can iterate/improve solutions faster.

What this means for business people who don't care about any of that high-performance-numerical-computing-math stuff: the computer is faster and your engineers can make you more money.

File Navigation

exploration: I tinker with OpenMP to make sure it works on my PC.

utils: Contains the master helper file with all algorithm implementations (some algorithms are helper functions; this is convenient for testing).

tests: Speed and orthogonality error tests. The raw .csv for the speed tests (in seconds) is in the /data path.*

The remaining folders are named after their algorithm and contain a .cpp file with the respective implementation.

*I know it is bad practice to put the .csv in a Github repo, but its size is negligible and the raw data provides relevant insights.

Background

Skip if you have an undergraduate-level understanding of numerical computing and parallelism.

Tall-and-Skinny Matrices

A R m , n is tall and skinny when m n .

The most common occurrence of these is design matrices for machine learning where the number of data points is much larger than the number of features. Other common applications of tall-and-skinny matrices are Fourier transforms in sensors, finite element methods, and the Jacobian matrices for iterative optimization algorithms.

A matrix is "short-and-wide" when m n . Short-and-wide problems are analogous to tall-and-skinny problems under the transpose operation.

Machine Epsilon or Machine Precision

Upper bound on approximation error for floating point computations. Unit roundoff is a synonym. It is denoted by ϵ machine .

Given b -based arithmetic and t -digit precision, ϵ machine = b 1 t . Common standards are IEEE 754 single-precision where b = 2 and t = 24 ( ϵ machine 1.1920929 × 10 7 ) and double-precision where b = 2 and t = 52 ( ϵ machine 2.22044605 × 10 16 ).

Informally, this is the smallest difference between one floating point number and another.

Condition Number

Measures the sensitivity of the solution to pertubations in the input. It is typically denoted by κ . Condition number with respect to a matrix is denoted by κ ( A ) .

Suppose f : X Y is an exact function and f ^ is an algorithm. δ f ( x ) = f ( x ) f ^ ( x ) where x X .

Absolute condition number: lim ϵ machine 0 + [ sup | | δ x | | ϵ machine | | δ f ( x ) | | | | δ x | | ] .

Relative condition number: lim ϵ machine 0 + [ sup | | δ x | | ϵ machine | | δ f ( x ) | | / | | f ( x ) | | | | δ x | | / | | x | | ] .

When using the L 2 norm, κ ( A ) = σ max ( A ) σ min ( A ) where σ max and σ min are the maximum and minimum singular values of A , respectively.

If a condition number is near 1 , it is called well-conditioned. If the condition number is very large, it is ill-conditioned. Examples of ill-conditioned matrices are Hilbert matrices, Vandermonde matrices with dense nodes, and matrices with nearly-independent rows/columns.

QR Decomposition

A matrix decomposition method known primarily for solving linear least squares via back-substitution ( A = Q R and A x = b Q R x = b R x = Q T b because Q is orthonormal).

A R m , n Q R m , n and R R n , n .

Q is orthonormal and R is upper-triangular.

A QR decomposition is "thin" when m > n . A thin decomposition follows as such: $A = QR = \begin{bmatrix} Q_1 & Q_2\end{bmatrix}\begin{bmatrix}R_1 \ 0\end{bmatrix} = Q_1R_1$.

QR decomposition is preferred to the normal equations for solving linear systems since the normal equations square the condition number and may lead to significant rounding errors when A is singular. The condition number of using the normal equations is κ ( A ) 2 while its QR decomposition counterpart's is O ( κ ( A ) ϵ ) .

There are various other special properties regarding the matrices in QR decomposition and variants of the algorithm that improve the condition number + computation speed. I implore you to discover them for yourself; this documentation is a very basic crash course.

Gram-Schmidt Orthogonalization

The Gram-Schmidt algorithm finds an orthonormal basis for a set of linearly independent vectors.

Algorithm: Gram-Schmidt Orthogonalization

Input: Linearly independent vectors v 0 , , v n in an inner product space.

Output: Orthonormal basis e 0 , , e n .

  1. Initialize:

    • Set u 0 = v 0
    • Normalize: e 0 = u 0 | u 0 |
  2. For i = 1 n + 1 :

    • Set u i = v i
    • For j = 0 i :
      • Compute projection: proj u j ( v i ) = v i , u j u j , u j u j
      • Subtract projection: u i = u i proj u j ( v j )
    • Normalize: e i = u i | u i |
  3. Return e 0 , , e n

This orthogonalization method is relevant because $Q = \begin{bmatrix}e_0 \cdots e_n\end{bmatrix}$ (the orthonormal vectors) and R i , j = v j , e i ,   0 i j n (the projection coefficients). One may deduce that R i , i = | u i | .

Fine vs Course-Grained Parallelization

This will remain relatively high-level for brevity.

Fine-Grained Parallelization

Decomposes a large computation into trivial tasks at an individual (or small block) level.

If I do a matrix multiplication, task size T for each processor may be c i , j = a i , k b k , j (relatively VERY small).

Communication cost C fine grows at rate O ( P c ) where P is the number of processors and c is the cost per communication. This communication rate is relatively high, meaning it is advantageous to have fast communication in the form of shared memory architecture.

Read more about how NVIDIA is taking advantage of shared memory here.

The work-to-communication ratio is very small, implying each processor performs few computations.

Course-Grained Parallelization

Decomposes a large computation into medium to large-sized tasks.

Following the matrix multiplication example, | T | may be $\vec{a}i\vec{b_j}$ or $A{I, K}B_{K, J}$ where A I , K and B K , J are matrix slices.

Communication cost C coarse is O ( P c ) , much lower than that of fine-grained parallelization.

Work-to-communication ratio is relatively large, implying each processor performs MANY computations.

Coarse-grained parallelization is better suited for problems limited by synchronization and communication latency such as in distributed databases where data is partitioned across nodes or in graph algorithms whose workload per edge/vertex varies greatly.

Weak and Strong Scaling Speedup

Strong scaling measures how execution time decreases as the number of processors P increases. The strong speedup is the ratio of time taken to complete a task with one processor t ( 1 ) and the same task with P processors, denoted by t ( P ) .

speedup strong = t ( 1 ) t ( P )

See Amdahl's law for more details. Hiring more people to paint a fence speeds it up, but adding too many does not since the work cannot be divided infinitely.

Weak scaling measures how execution time changes as P increases while task-size per processor is constant. The weak speedup is...

speedup weak = s + p P

where s is the proportion of execution time spent computing serial tasks and p is that of the parallel tasks.

Informally, weak scaling and Gustafson's law explain that increasing the problem size and the number of processors results in near-linear speedups. Instead of painting a fence faster, paint a longer fence in the same amount of time by hiring more people.

Algorithms

See this link for the full pseudocode; it is not ALL rewritten here for brevity.

Suppose p = number of processors.

Cholesky QR (CQR)

Algorithm: CholeskyQR

Input: Matrix A R m × n .

Output: Matrices Q R m × n and R R n × n .

  1. Construct Gram Matrix:

    • W = A T A
  2. Cholesky Factorization:

    • W = R T R
  3. Compute ( Q ):

    • Q = A R 1
  4. Return ( Q , R )

Parallel Cholesky QR

Algorithm: ParallelCholeskyQR

Input: Matrix A R m × n .

Output: Matrices Q R m × n and R R n × n .

  1. Initialize Variables

    • rows rows ( A )
    • cols cols ( A )
    • threads max-threads()
    • W as a zero matrix of size n × n
    • local W as an array of zero matrices n × n , one for each thread
  2. Compute Gram Matrix in Parallel

    • Parallel for each thread id [ 0 , . . . , threads 1 ] :
      1. chunk size rows threads
      2. start thread id × chunk size
      3. $\text{end} \gets \begin{cases} \text{rows}, & \text{if thread id} = \text{threads} - 1 \ \text{start} +\text{chunk size}, & \text{otherwise} \end{cases}$
      4. A i A [ start : end ]
      5. local W [ thread id ] A i T A i
      6. Critical Section: W W + local W [ thread id ]
      7. Q [ start : end ] A i
  3. Perform Cholesky Factorization

    • W = R T R
  4. Compute Q in Parallel

    • Parallel for each thread id [ 0 , . . . , threads 1 ] :
      1. Q [ start : end ] Q [ start : end ] × R 1
  5. Return ( Q , R )

The Gram Matrix is computed in parallel by slicing A into chunks, reducing complexity to O ( m n 2 p ) from O ( m n 2 ) . Synchronization overhead from updating W is negligible. Computing the final Q in parallel scales similarly.

Cholesky QR 2 (CQR2)

Algorithm: CQR2

Input: Matrix A R m × n .

Output: Matrices Q R m × n and R R n × n .

  1. First QR Decomposition

    • [ Q 1 , R 1 ] ParallelCholeskyQR ( A )
  2. Second QR Decomposition for Accuracy

    • [ Q , R 2 ] ParallelCholeskyQR ( Q 1 )
  3. Compute Final R

    • R R 2 R 1
  4. Return ( Q , R )

CQR can produce non-orthogonal vectors, becoming unstable as the condition number increases. Repeating orthogonalization improves stability, as detailed here. Orthogonality error scales as O ( κ ( A ) 2 \bold u ) .

Shifted Cholesky QR (sCQR)

From here I will ONLY be giving a brief explanation of each algorithm. See the paper for pseudocode.

A shift s = m \bold u | | A | | F 2 is applied to the diagonal of the Gram matrix to force it to be positive definite. The rest of the steps follow CQR.

Shifted Cholesky QR 3 (sCQR3)

This is essentially CQR2 but instead of applying CQR twice, it applies sCQR as a preconditioner to CQR2, which achieves further orthogonalization.

Cholesky QR with Gram-Schmidt (CQRGS)

Similar to CQR but with block processing and panel update/reorthogonalization before computing the final R .

Cholesky QR2 with Gram-Schmidt (CQR2GS)

CQR2 with CQRGS instead of CQR, leveraging parallel block processing and Gram-Schmidt reorthogonalization. It improves stability, efficiency, and accuracy while optimizing computational cost.

Modified Cholesky QR2 with Gram-Schmidt (mCQR2GS)

mCQR2GS restructures CQRGS to reduce the number of panels while maintaining computational and communication efficiency. It adaptively selects the paneling strategy based on matrix conditioning, ensuring stability with fewer operations. Compared to CQR2GS, mCQR2GS requires fewer floating-point operations by avoiding explicit factor construction and achieves better orthogonality with fewer panels for high-condition-number matrices.

Conclusion

This project served as a hands-on exploration of parallel QR decomposition using OpenMP, blending high-performance computing with numerical linear algebra.

While the concepts are not novel, implementing them deepened my understanding of parallelization and provided a practical refresh on QR decomposition. The algorithms showcased here accelerate least-squares regression and eigenvalue computations, making large-scale data analysis more efficient. This lays the groundwork for a more advanced project I will begin soon relating to Asian options.

Side Note

I used the Intel VTune flame graph for performance analysis and stack trace inspection.

I burnt out writing the documentation and realized I want to spend more time writing code instead of these words that people won't read. For this reason, I will likely start contributing to some kind of relevant open-source project related to numerical/parallel computing moving forward because writing documentation is time-intensive and (usually) not very exciting. I also want to spend more time watching "The Walking Dead" after work. I watched through season 7 a few years ago but never saw the whole show. I also want to watch the spinoffs. Realistically, I'll start another project while watching it because I want to learn more things, advance my career, and all that other stuff, but I thought I would let you know where I'm at.