Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Fix #211 Optimize reduce-scatter for Q matrix #212

Merged
merged 5 commits into from
Mar 16, 2024
Merged

Conversation

vasdommes
Copy link
Collaborator

@vasdommes vasdommes commented Mar 16, 2024

TL;DR In our benchmark (see below), reduce-scatter became ~50x faster. Thanks to that, SDPB became ~2x faster on big problems, and scales much better for large number of nodes.

Old SDPB version uses ring algorithm among all ranks to accumulate local contributions to the Q matrix and write them to the global DistMatrix.
Ring algorithm consists of num_ranks - 1 iterations, and each rank was sending ~ #(Q) / num_ranks matrix elements to another rank.

That made sense before we started using shared memory window for calculating Q, see #142.
Now all ranks on a node have access to the shared memory window containing residues of Q_n (contribution to Q from node n) modulo all primes.

Therefore, no communication inside a node is required, and all we have to do is to reduce Q_n from all nodes into global Q.
If a node owns some element Q[i, j], then all other nodes should send their Q_n[i, j] to that node. In addition, the node restores its own contribution from residues.

If each node has one rank, then the implemented algorithm works as follows:

for offset = 1..num_nodes-1:
  - Each node (n) restores elements Q_n[i,j] for all [i,j] owned by node (n+offset) from residues,
    and sends them to node (n+offset). Implemented via MPI_Sendrecv.
  - Each node updates its own Q[i,j] with data received from node (n-offset).

Since each node usually has multiple ranks, we have to distribute the job between them. We use the following scheme:
If rank r owns the given element Q[i,j], then it will receive contributions from other nodes from the processes having the same rank within a node (we use the fact that all nodes have the same number of ranks).
For example, for 3 nodes with 128 cores, rank 0 will communicate only with ranks 128 and 256, rank 1 - with ranks 129 and 257, and so on. As a result, MPI_Sendrecv is called on communicators (0,128,256), (1,129,257) etc.

Each rank has send and receive buffers of size ~ #(Q) / num_ranks, and performs (num_nodes - 1) send/recv operations.

…rts + cosmetic code changes

In current SDPB code, output.DistComm() == COMM_WORLD always
Old reduce-scatter implementation used ring algorithm for all ranks.
This led to MPI communication within a node.

But now each rank on a node has access to output_residue_window and can restore each Q_ij on a node.
Thus, we don't need communication within a node. Now we reduce Q only between different nodes.

TODO: remove obsolete reduce_scatter() function.
…e bigint_syrk/Readme.md

New algorithm is implemented in BigInt_Shared_Memory_Syrk_Context::restore_and_reduce()
@vasdommes vasdommes added this to the 3.0.0 milestone Mar 16, 2024
@vasdommes vasdommes self-assigned this Mar 16, 2024
@vasdommes
Copy link
Collaborator Author

vasdommes commented Mar 16, 2024

Benchmarks on Expanse HPC for nmax=18 stress-tensors-3d:
Q is 5485x5485 matrix, precision=1024.

We see that the old reduce-scatter algorithm (first plot) took significant amount of time (5-10 minutes per solver iteration). Time grows linearly with the number of nodes and does not allow to speed up SDPB by increasing number of nodes.

With the new algorithm (second plot), reduce-scatter works really fast (~10 seconds per solver iteration!), and overall SDPB performance scales very well even for 10+ nodes.

Overall, SDPB is now ~2x faster on 7 nodes and 3.5x faster for 12 nodes.

image
image

@vasdommes
Copy link
Collaborator Author

Note also that communication itself (mpi_sendrecv) is fast, most of the time is spent on serializing/deserializing BigFloat numbers.
image

@vasdommes vasdommes merged commit 1b170dd into master Mar 16, 2024
2 checks passed
@vasdommes vasdommes deleted the reduce-scatter branch March 16, 2024 05:57
# for free to join this conversation on GitHub. Already have an account? # to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Optimize reduce-scatter for Q matrix
1 participant