feat: deterministic DAG structure #36

alanshaw · 2024-03-11T15:39:16Z

This PR changes the sharding strategy to ensure a pail DAG is deterministic in structure (provided operations do not touch the same keys) i.e. operations can be applied in any order, and they will result in the same DAG.

This makes diffing easier, and can enable pail to operate efficiently in a diff syncing environment without a merkle clock.

It also results in a significant speedup to put operations, due to the reduced string comparison and manipulation operations needed.

$ node bench/put-x10_000.js   
setup
bench
..........
put x10000: 51.736s

$ node bench/put-x10_000-v1.js
setup
bench
..........
put x10000: 1.136s

While the strategy (and configuration) for shards has changed, the main format has not. This means that v1 can read v0 but should not write to it.

The change in the sharding strategy is that instead of sharding at the back of a key (largest common prefix) when the size limit is reached, we shard at the front, using the smallest common prefix (which is always a single character) always, if a common prefix exists. This effectively ensures consistent sharding.

For example, previously when putting aaaa and then aabb we might end up with a DAG like this when the shard size is reached:

aa -> aa
      bb

If we then put abbb we might then get a DAG like:

aa -> aa
      bb
abbb

...but if abbb was put BEFORE aabb we may have ended up with a DAG like:

a -> aaa
     abb
     bbb

Now we always end up with a DAG like the following, because we always shard when there's a common prefix, independent of the order and indepentent of shard size:

a -> a -> aa
          bb
     bbb

That is to say, there is no maximum shard size, and the max key length is now absolute, not just the point at which a shard is created.

The maximum shard size is controlled by the creator of the pail by specifying 2 options:

keyChars - the characters allowed in keys (default ASCII only).
maxKeySize - maximum size in bytes a UTF-8 encoded key is allowed to be (default 4096 bytes).

A good estimate for the max size of a shard is the number of characters allowed in keys multiplied by the maximum key size. For ASCII keys of 4096 bytes the biggest possible shard will stay well within the maximum block size allowed by libp2p.

Note: The default max key size is the same as MAX_PATH - the maximum filename+path size on most windows/unix systems so should be sifficient for most purposes.

Note: even if you use unicode key characters it would still be tricky to exceed the max libp2p block size, but it is not impossible.

Overall this makes pail skewed less towards quick reads and more towards quick writes. This is contrary to the original goal of the project but on balance I think worth the trade offs, in view of the DAG structure properties and the write speed increase.

I discovered a bug with removals in the current implementation - In theory there's a chance if you put the same value to multiple keys and then update it, a shard might appear in the removals list that is a shard in a different part of the tree. The new version fixes this by encoding the path prefix into each shard.

The following trade offs exist:

Total DAG size is around 20% bigger.
Theres around around 8x more blocks in the DAG.
Average DAG depth (the number of blocks you need to traverse to extract a value) is now ~4 vs ~2 previously. This will have a direct impact in read speed, especially network bound reads. Mitigated somewhat by the smaller block size.

However:

Put operations are now ~50x faster.
Average block size is 85% smaller.
The biggest shard in the DAG is now 99% smaller.
DAG structure is deterministic, which should speed up diffing

jchris

How will the v1 API choice show up to downstream users? It can read the old version but the new version can't read it? That makes for an easy upgrade path on my end.

burdiyan · 2024-05-05T21:02:49Z

With these changes Pail now seems a bit more like a Prolly Tree, at least in terms of its properties: determinism, sorted iterations, etc.

What use cases would be better covered by Pail compared to Prolly Tree?

jchris approved these changes Mar 11, 2024

View reviewed changes

Alan Shaw added 4 commits March 27, 2024 10:33

feat: deterministic DAG structure

73b8e3b

test: add tests

5eefdff

fix: tweaks

bd0f149

feat: add lt and lte entries option

d9ffcfa

alanshaw force-pushed the feat/deterministic-structure branch from 373e485 to d9ffcfa Compare March 27, 2024 10:40

Alan Shaw added 4 commits March 27, 2024 15:39

fix: key range

8e697b5

fix: undefined or null options

df640b1

fix: entries range

cae1621

feat: implement batching and cleanup

ea2415b

alanshaw merged commit a26acea into main Apr 22, 2024
2 checks passed

alanshaw deleted the feat/deterministic-structure branch April 22, 2024 12:48

github-actions bot mentioned this pull request Apr 22, 2024

chore(main): release 0.6.0 #39

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: deterministic DAG structure #36

feat: deterministic DAG structure #36

alanshaw commented Mar 11, 2024 •

edited

Loading

jchris left a comment

burdiyan commented May 5, 2024

feat: deterministic DAG structure #36

feat: deterministic DAG structure #36

Conversation

alanshaw commented Mar 11, 2024 • edited Loading

jchris left a comment

Choose a reason for hiding this comment

burdiyan commented May 5, 2024

alanshaw commented Mar 11, 2024 •

edited

Loading