Fix #211 Optimize reduce-scatter for Q matrix #212

vasdommes · 2024-03-16T05:22:29Z

Fix Optimize reduce-scatter for Q matrix #211

TL;DR In our benchmark (see below), reduce-scatter became ~50x faster. Thanks to that, SDPB became ~2x faster on big problems, and scales much better for large number of nodes.

Old SDPB version uses ring algorithm among all ranks to accumulate local contributions to the Q matrix and write them to the global DistMatrix.
Ring algorithm consists of num_ranks - 1 iterations, and each rank was sending ~ #(Q) / num_ranks matrix elements to another rank.

That made sense before we started using shared memory window for calculating Q, see #142.
Now all ranks on a node have access to the shared memory window containing residues of Q_n (contribution to Q from node n) modulo all primes.

Therefore, no communication inside a node is required, and all we have to do is to reduce Q_n from all nodes into global Q.
If a node owns some element Q[i, j], then all other nodes should send their Q_n[i, j] to that node. In addition, the node restores its own contribution from residues.

If each node has one rank, then the implemented algorithm works as follows:

for offset = 1..num_nodes-1:
  - Each node (n) restores elements Q_n[i,j] for all [i,j] owned by node (n+offset) from residues,
    and sends them to node (n+offset). Implemented via MPI_Sendrecv.
  - Each node updates its own Q[i,j] with data received from node (n-offset).

Since each node usually has multiple ranks, we have to distribute the job between them. We use the following scheme:
If rank r owns the given element Q[i,j], then it will receive contributions from other nodes from the processes having the same rank within a node (we use the fact that all nodes have the same number of ranks).
For example, for 3 nodes with 128 cores, rank 0 will communicate only with ranks 128 and 256, rank 1 - with ranks 129 and 257, and so on. As a result, MPI_Sendrecv is called on communicators (0,128,256), (1,129,257) etc.

Each rank has send and receive buffers of size ~ #(Q) / num_ranks, and performs (num_nodes - 1) send/recv operations.

…rts + cosmetic code changes In current SDPB code, output.DistComm() == COMM_WORLD always

…r.test.cxx, add INFO message about MPI size

Old reduce-scatter implementation used ring algorithm for all ranks. This led to MPI communication within a node. But now each rank on a node has access to output_residue_window and can restore each Q_ij on a node. Thus, we don't need communication within a node. Now we reduce Q only between different nodes. TODO: remove obsolete reduce_scatter() function.

…e bigint_syrk/Readme.md New algorithm is implemented in BigInt_Shared_Memory_Syrk_Context::restore_and_reduce()

vasdommes · 2024-03-16T05:52:19Z

Benchmarks on Expanse HPC for nmax=18 stress-tensors-3d:
Q is 5485x5485 matrix, precision=1024.

We see that the old reduce-scatter algorithm (first plot) took significant amount of time (5-10 minutes per solver iteration). Time grows linearly with the number of nodes and does not allow to speed up SDPB by increasing number of nodes.

With the new algorithm (second plot), reduce-scatter works really fast (~10 seconds per solver iteration!), and overall SDPB performance scales very well even for 10+ nodes.

Overall, SDPB is now ~2x faster on 7 nodes and 3.5x faster for 12 nodes.

vasdommes · 2024-03-16T05:57:34Z

Note also that communication itself (mpi_sendrecv) is fast, most of the time is spent on serializing/deserializing BigFloat numbers.

vasdommes added 5 commits March 14, 2024 18:53

reduce_scatter(): replace COMM_WORLD with output.DistComm(), add asse…

f3a751c

…rts + cosmetic code changes In current SDPB code, output.DistComm() == COMM_WORLD always

Add node_size=6 to calculate_matrix_square.test.cxx and reduce_scatte…

79d6568

…r.test.cxx, add INFO message about MPI size

Remove obsolete reduce_scatter_DistMatrix.hxx and tests for it, updat…

7ed433c

…e bigint_syrk/Readme.md New algorithm is implemented in BigInt_Shared_Memory_Syrk_Context::restore_and_reduce()

Add gmpxx dependency to waf-tools/flint.py

7244862

vasdommes added enhancement performance labels Mar 16, 2024

vasdommes added this to the 3.0.0 milestone Mar 16, 2024

vasdommes self-assigned this Mar 16, 2024

vasdommes merged commit 1b170dd into master Mar 16, 2024
2 checks passed

vasdommes deleted the reduce-scatter branch March 16, 2024 05:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix #211 Optimize reduce-scatter for Q matrix #212

Fix #211 Optimize reduce-scatter for Q matrix #212

vasdommes commented Mar 16, 2024 •

edited

Loading

vasdommes commented Mar 16, 2024 •

edited

Loading

vasdommes commented Mar 16, 2024

Fix #211 Optimize reduce-scatter for Q matrix #212

Fix #211 Optimize reduce-scatter for Q matrix #212

Conversation

vasdommes commented Mar 16, 2024 • edited Loading

vasdommes commented Mar 16, 2024 • edited Loading

vasdommes commented Mar 16, 2024

vasdommes commented Mar 16, 2024 •

edited

Loading

vasdommes commented Mar 16, 2024 •

edited

Loading