Fix #211 Optimize reduce-scatter for Q matrix #212
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
TL;DR In our benchmark (see below), reduce-scatter became ~50x faster. Thanks to that, SDPB became ~2x faster on big problems, and scales much better for large number of nodes.
Old SDPB version uses ring algorithm among all ranks to accumulate local contributions to the Q matrix and write them to the global DistMatrix.
Ring algorithm consists of
num_ranks - 1
iterations, and each rank was sending~ #(Q) / num_ranks
matrix elements to another rank.That made sense before we started using shared memory window for calculating Q, see #142.
Now all ranks on a node have access to the shared memory window containing residues of Q_n (contribution to Q from node n) modulo all primes.
Therefore, no communication inside a node is required, and all we have to do is to reduce Q_n from all nodes into global Q.
If a node owns some element
Q[i, j]
, then all other nodes should send theirQ_n[i, j]
to that node. In addition, the node restores its own contribution from residues.If each node has one rank, then the implemented algorithm works as follows:
Since each node usually has multiple ranks, we have to distribute the job between them. We use the following scheme:
If rank r owns the given element Q[i,j], then it will receive contributions from other nodes from the processes having the same rank within a node (we use the fact that all nodes have the same number of ranks).
For example, for 3 nodes with 128 cores, rank 0 will communicate only with ranks 128 and 256, rank 1 - with ranks 129 and 257, and so on. As a result,
MPI_Sendrecv
is called on communicators(0,128,256)
,(1,129,257)
etc.Each rank has send and receive buffers of size
~ #(Q) / num_ranks
, and performs(num_nodes - 1)
send/recv operations.