Sync from newest to oldest #620

jbearer · 2024-06-04T20:47:14Z

It would be better to run proactive scans, especially major scans, backwards, because the newest data is both the most likely to be missing (in the case where a node has just been offline for a short time, like an update) and the most likely to be queried.

Adding a reverse block stream might help simplify some of the transaction streaming stuff for the explorer as well.

jbearer · 2024-06-05T22:49:49Z

I'm thinking of an idea that might be better. Instead of having major and minor scans that run repeatedly, we just have 2 scans total, that both run perpetually, and we dovetail them in chunks. I will call these scans "heads" and "tails".

The heads scan always follows the chain head. It's job is to fetch recent blocks that we may have missed on decide, such as if we briefly lost network connection and missed a decide. The tails scan starts at the tail of the chain and follows behind the heads scan. Its job is to fetch old data that we're missing, such as after a long restart.

Both of these scans follow the same exact pattern, except for where they start and what they do when the reach the current block height:

The tails scan starts at block 0 (or the pruned height)
The heads scan starts at the last saved block height
When a scan runs, it starts from where it previously left off, and fetches blocks until it reaches a chunk limit, or the tip of the chain. Then it sleeps and yields to the other scan.
When the heads scan hits the tip of the chain, it just yields. This way it will most of the time be following very close to the chain tip, but if the chain tip ever jumps forward a lot (such as after a network outage or restart) the heads scan will go over all the blocks that were missed
When the tails scan catches up to the heads scan (it will always catch up, because it can scan as fast as it is able, while the heads scan is bounded by the time it takes for new blocks to be decided) it starts over from 0 (or the pruned height)

The big advantage of this scheme is a bounded amount of work done each time a scan runs. We don't have occasional long pauses where a major scan runs, CPU usage spikes, and we stop fetching more recent missing blocks (because minor scans have to wait for the major scan to finish).

Another big advantage is on startup, we don't have to wait for a major scan to read from the database all the way from 0, before we start fetching more recent blocks which are more likely to be missing.

This changes leaf fetching to require already having the _next_ leaf. This tells us what the hash should be for the leaf being fetched. This addresses two unrelated issues: 1. Leaf fetching is trusted: we were not previously verifying that the leaf returned by a peer is valid in any way. Now, we can verify the fetched leaf against the expected hash (and similarly for the fetched QC). We exploit the chaining property of HotShot leaves to avoid having to run any kind of consensus light client to verify fetched leaves. 2. Fetching leaves by hash instead of (or in addition to) by height allows us to implement more providers. For example, we can now implement a provider that pulls leaves from undecided consensus storage, by hash, which allows us to fetch a leaf from our own storage even if we missed the corresponding decide event. Knock-on changes: * Proactive scanning now runs backwards, since leaf fetching becomes kind of inherently backwards. This was a change we wanted anyways (closes #620) * To facilitate that, we have added reverse streams for all the resource types, which may come in handy for the explorer * Receiving a leaf (either by fetching or decide event) now triggers us to fetch the parent if it is missing. This is supplements the proactive fetcher and can often help us obtain missing data really quickly (e.g. if we just missed one decide) * `NoStorage` is gone. The purpose of `NoStorage` was to stress test fetching, which it did in the beginning, but it has been ages since we found a real bug this way; we now have plenty of other adversarial fetching tests, and `NoStorage` is becoming more trouble than it's worth to maintain. In particular, effective fetching now depends on having somewhat reliable storage (a reasonable assumption!) because we assume if you fetch a leaf, you can look it up shortly thereafter and thus fetch its parent. Thus, the `NoStorage` tests were failing or very slow because of many failed requests, with this change.

jbearer added the cappuccino-followup label Jun 4, 2024

jbearer self-assigned this Oct 31, 2024

jbearer linked a pull request Oct 31, 2024 that will close this issue

Fetch leaves recursively by hash, verifying chaining #726

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sync from newest to oldest #620

Sync from newest to oldest #620

jbearer commented Jun 4, 2024

jbearer commented Jun 5, 2024

Sync from newest to oldest #620

Sync from newest to oldest #620

Comments

jbearer commented Jun 4, 2024

jbearer commented Jun 5, 2024