Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

feat: deterministic DAG structure #36

Merged
merged 8 commits into from
Apr 22, 2024
Merged

Conversation

alanshaw
Copy link
Member

@alanshaw alanshaw commented Mar 11, 2024

This PR changes the sharding strategy to ensure a pail DAG is deterministic in structure (provided operations do not touch the same keys) i.e. operations can be applied in any order, and they will result in the same DAG.

This makes diffing easier, and can enable pail to operate efficiently in a diff syncing environment without a merkle clock.

It also results in a significant speedup to put operations, due to the reduced string comparison and manipulation operations needed.

$ node bench/put-x10_000.js   
setup
bench
..........
put x10000: 51.736s

$ node bench/put-x10_000-v1.js
setup
bench
..........
put x10000: 1.136s

While the strategy (and configuration) for shards has changed, the main format has not. This means that v1 can read v0 but should not write to it.

The change in the sharding strategy is that instead of sharding at the back of a key (largest common prefix) when the size limit is reached, we shard at the front, using the smallest common prefix (which is always a single character) always, if a common prefix exists. This effectively ensures consistent sharding.

For example, previously when putting aaaa and then aabb we might end up with a DAG like this when the shard size is reached:

aa -> aa
      bb

If we then put abbb we might then get a DAG like:

aa -> aa
      bb
abbb

...but if abbb was put BEFORE aabb we may have ended up with a DAG like:

a -> aaa
     abb
     bbb

Now we always end up with a DAG like the following, because we always shard when there's a common prefix, independent of the order and indepentent of shard size:

a -> a -> aa
          bb
     bbb

That is to say, there is no maximum shard size, and the max key length is now absolute, not just the point at which a shard is created.

The maximum shard size is controlled by the creator of the pail by specifying 2 options:

  • keyChars - the characters allowed in keys (default ASCII only).
  • maxKeySize - maximum size in bytes a UTF-8 encoded key is allowed to be (default 4096 bytes).

A good estimate for the max size of a shard is the number of characters allowed in keys multiplied by the maximum key size. For ASCII keys of 4096 bytes the biggest possible shard will stay well within the maximum block size allowed by libp2p.

Note: The default max key size is the same as MAX_PATH - the maximum filename+path size on most windows/unix systems so should be sifficient for most purposes.

Note: even if you use unicode key characters it would still be tricky to exceed the max libp2p block size, but it is not impossible.

Overall this makes pail skewed less towards quick reads and more towards quick writes. This is contrary to the original goal of the project but on balance I think worth the trade offs, in view of the DAG structure properties and the write speed increase.

I discovered a bug with removals in the current implementation - In theory there's a chance if you put the same value to multiple keys and then update it, a shard might appear in the removals list that is a shard in a different part of the tree. The new version fixes this by encoding the path prefix into each shard.

The following trade offs exist:

  • Total DAG size is around 20% bigger.
  • Theres around around 8x more blocks in the DAG.
  • Average DAG depth (the number of blocks you need to traverse to extract a value) is now ~4 vs ~2 previously. This will have a direct impact in read speed, especially network bound reads. Mitigated somewhat by the smaller block size.

However:

  • Put operations are now ~50x faster.
  • Average block size is 85% smaller.
  • The biggest shard in the DAG is now 99% smaller.
  • DAG structure is deterministic, which should speed up diffing

Copy link
Contributor

@jchris jchris left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How will the v1 API choice show up to downstream users? It can read the old version but the new version can't read it? That makes for an easy upgrade path on my end.

@alanshaw alanshaw force-pushed the feat/deterministic-structure branch from 373e485 to d9ffcfa Compare March 27, 2024 10:40
@alanshaw alanshaw merged commit a26acea into main Apr 22, 2024
2 checks passed
@alanshaw alanshaw deleted the feat/deterministic-structure branch April 22, 2024 12:48
@burdiyan
Copy link

burdiyan commented May 5, 2024

With these changes Pail now seems a bit more like a Prolly Tree, at least in terms of its properties: determinism, sorted iterations, etc.

What use cases would be better covered by Pail compared to Prolly Tree?

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants