Skip to content

Data Structures

Martin Thompson edited this page Feb 12, 2014 · 16 revisions

To support the transport of messages from senders to receivers a number of data structures are required. These data structures need to conform to a number of design principles for the common transport path, exceptional cases can be afforded an exceptional path, e.g. large messages:

  1. Copy-Free: The send and receive buffer is directly used.
  2. Allocation-Free: The send and receive buffer is pre-allocated so that the transport path is free from the allocation or reclamation of memory.
  3. Lock-Free: Lock-free techniques are employed to manage concurrent access.
  4. Wait-Free: Concurrent algorithms will complete within a finite number of steps.
  5. Persistent/Immutable: Message buffers for send and receive record immutable history to allow replay and resend semantics.
  6. O(1) Cost: All operations are constant time regardless of buffer size, senders, receivers, contention, etc.

Sender Archive

The sender needs to be able to support the efficient retransmission of messages. Messages may need to be retransmitted to cope with loss or late joiners to a communications stream. Each archive represents a term in the communication history from the source.

 Sender Archive - Memory Mapped File
+--------------------------------------------+---------------------+---------------+
|   Message buffer (64-byte aligned units)   | Index (msg offsets) | State Trailer |
+--------------------------------------------+---------------------+---------------+

Messages are stored to the buffer in preparation for transmission as a sequential stream with FIFO semantics. The next n bytes of the buffer are claimed by performing an atomic increment of the tail counter stored in the state trailer. The algorithm is designed to take advantage of LOCK XADD on x86 to avoid spinning CAS (LOCK CMPXCHG) loops.

The producers of message call send(bytes) on a source object passing the bytes to be sent. The send call then follows the following algorithm:

  1. IF message is bigger than transmission frame size THEN
    1. Send in chunks
  2. END IF
  3. WHILE next capacity claim < message size
    1. Mark claimed capacity with tombstone
    2. Move on to next term
  4. END WHILE
  5. Copy in message
  6. Mark record in archive as complete

A sender thread observes the sender archive and repeatedly performs the following algorithm:

  1. Assign message sequence
  2. Update index for message sequence offset
  3. Send message on underlying network layer

Message Index

Each archive contains an index of messages that can be used for the fast lookup of messages from sequence numbers.

State Trailer

The state trailer contains the variables describing the current state of the archive for the communications term. Variables are stored with consideration for false sharing issues when concurrent access is required.

  • tail: buffer offset at which the next message can be added. An atomic get-and-increment operation is used to advance the tail for claiming capacity in the buffer.
  • head: the offset in the buffer up to which messages have be transported.
  • term: current term of the archive
  • source: identifier for the source of messages

Receiver Archive

The receiver archive is a mirror of the sender archive with reciprocal semantics for assembling a sequence of messages from a sender over an unreliable network layer.

A receiver thread receives messages from the network layer into the receiver archive.