-
Notifications
You must be signed in to change notification settings - Fork 96
merkle_tree_on_disk
[Last updated October 2006]
The primary goals of moving DHash's Merkle tree synchronization system to a disk-based approach are to reduce the memory pressure caused by holding multiple Merkle tree in memory in both lsd
and syncd
, and also reduce the amount of disk I/O syncd
must do to synchronize with each of its neighbors. In the current system, syncd
chooses one successor per synchronization period, reads all the keys that it knows about for that successor off of disk, and constructs an in-memory Merkle tree. In the new system, syncd
would need only the root block for a successor-specific tree from disk, and then if it differs, it need only read certain blocks from disk to continue with the synchronization, rather than constructing the complete tree in memory.
The changes will affect all the major components of Chord/DHash: syncd
, lsd
, and adbd
. First, we'll look at new data structure for an on-disk Merkle tree, and then we'll talk about the changes needed for each component.
For each logically separate group of keys that DHash must keep track of (the keys for the blocks stored on the local node, and the set of key for blocks stored on each successor node), DHash with keep a separate set of three files:
-
Leaf nodes (
<db_dir>/<node_id>.<ctype>/leaf.mrk
): This file consists of blocks of up to 64 20-byte keys, plus an integer count of the number of keys in the block. The same amount of space will be allocated for each block (1284 bytes), so a block offset is enough to skip directly to a particular record in the file. -
Internal nodes (
<db_dir>/<node_id>.<ctype>/internal.mrk
): This file consists of blocks of exactly 64 20-byte hashes. a pointer to each of its 64 children, and a combined key count for all 64 children. Each pointer is a block index into either the internal file or the leaf file, and consists of a 31-bit block offset (for up to 137 billion keys per tree), and a 1-bit identifier saying whether the block is internal or leaf. Each internal block is then (64*(20+4)+4)=1540 bytes. -
Index file (
<db_dir>/<node_id>.<ctype>/index.mrk
) This file contains a pointer to the root block of the Merkle tree, a freelist containing pointers to free blocks within both the leaf and internal files. The pointers are just like the pointers above.
The on-disk Merkle tree should be able to share a lot of code with the existing Merkle tree implementation. In fact, we should be able to create a subclass of the `merkle_tree'' class, let's call it ''merkle_tree_disk'', that overrides only a few functions:
insert (u_int depth, merkle_hash& key, merkle_node *n)
remove (u_int depth, merkle_hash& key, merkle_node *n)
lookup (u_int *depth, u_int max_depth, const merkle_hash &key, merkle_node *n)
- The constructor (take in file names, construct the root block, etc).
This will likely also require subclassing the merkle_node
class so that the child()
method constructs the child by looking it up on disk, and stores the children pointers properly. Note that this means the root
member of a merkle_tree
must be a pointer, so that it can be instantiated as a subclassed type.
The insert
, remove
, and lookup
functions will now require synchronous disk I/O in merkle_tree_disk
, so they should never be used directly by the lsd
process, which needs to remain un-blocked to handle network requests quickly. Thus, this implementation requires moving the handling of the local-keys Merkle tree into the syncd
process, and having the syncd
processes communicate among themselves. More on this in the next section.
Note that the atomicity of inserts will be important if more than one process has to read the Merkle tree. This could be done by copying all modified blocks into free slots, and only switching the root pointer in the index file once all blocks have been modified, and finally adding all the old blocks on to the free list. In the current design, adbd
must update the Merkle tree of local keys on a block insert, while syncd
reads it, so atomic inserts are pretty important.
syncd
will require the most work of any of the Chord/DHash components. Basically, it will have to serve as both the server and client in the synchronization protocol, similar to the way sampled
works. (In the current system, lsd
serves as the server during synchronization.) ``syncd` will consist of the following components:
- For each vnode:
- For each content type (noauth, chash, and possibly keyauth):
- A
merkle_server
, which listens for RPCs on a well-known port and answers queries using amerkle_tree_disk
over the nodes local set of keys. - A
syncer
that, once every synchronization period, picks a successor and makes amerkle_tree_disk
for the keys known to be on that successor. It passes this tree to themerkle_syncer
to make the RPCs to the remotesyncd
for synchronization.
- A
- For each content type (noauth, chash, and possibly keyauth):
The majority of this code already exists. Once the merkle_tree_disk
is implemented, it should just be a matter of moving things around from lsd
to syncd
.
When syncd
learns about new keys on a successor, it will insert them into the merkle_tree_disk
, which will save it on disk synchronously. It will also need to update the Berkeley DB files as usual, via adbd
, to make sure it maintains the correct replica counts so that lsd
can decide which replicas to repair.
All merkle_tree
and merkle_server
references should be removed entirely from lsd
. In the future, it's possible we'll need to change the repair algorithms, but it will still likely involve similar communication with adbd
, so maybe that won't involve much change.
adbd
must now update the Merkle tree over the node's local keys on every block insert and removal received from lsd
. This will require it to know the locations of the Merkle files on disk, but we can just put them in the same directory as the Berkeley DB files, so that we won't have to pass any more paths around.
Also, the use of the on-disk Merkle trees should obviate the need for the keyauxdb
file, though it still seems pretty useful for iterating through all keys for the local node, so maybe we'll keep it around (and updated by adbd
) for now.
Eventually it may make sense to use an on-disk min heap structure to inform lsd
of the objects in most need of repair, but we don't think this is too necessary for the first phase of the implementation.
We'll need a way to initially create the on-disk Merkle trees from the information currently stored in adbd
. I imagine the easiest way to do this will be to write a standalone program in devel/
or utils/
that reads in everything from adbd
and inserts the data into merkle_tree_disk
objects, making a new one everytime it comes across a new successor.