[ETCM-103] Restartable state sync #730

KonradStaniec · 2020-10-09T14:23:42Z

Description

Makes it possible to restart state sync if the pivot block goes stale during it.

Proposed Solution

The way it works is:

After switching to state sync, we leave the main fast sync loop running, but only thing we are doing there is checking how many peers have possible pivot block larger than our current by some margin.
When the number of peers with this better possible block is larger than min peers to pick pivot block we: stop state sync, pick new pivot, sync up blockchain date up to new pivot, restart state sync from new pivot

Possible improvements

The best possible improvement would be to have concurrent blockchain and state download, then this whole dance with updating pivot would be unnecessary, only thing the state sync would need to do would be to track best synced block and restart when best synced block is larger by some margin than current state sync pivot. This would require a little rehaul of the way we are handling available peers in upper layers of Mantis to avoid simultaneous concurrent request for state and blockchain data.

Bonus

SyncControllerSpec test have beed refactored to use autopilot.

Testing

I was able to sync to mainnet 4 times now with this setup. (for now without node restarts during state sync, as proper restarting requires one more ticket https://jira.iohk.io/browse/ETCM-213 i.e refiling bloom filter after node restart. Without it it should theroticly be possible to do but can be painfully slow due to large number of false positives from bloom filter which not coresspond to database content)

…le-state-sync

mmrozek · 2020-10-15T10:09:11Z

src/main/resources/application.conf

+    # Current Size of ETC state trie is aroud 150M Nodes, so 200M is set to have some reserve
+    # If the number of elements inserted into bloom filter would be significally higher that expected, then number
+    # of false positives would rise which would degrade performance of state sync
+    state-sync-bloomFilter-size = 200000000


state-sync-bloom-filter-size? to be consistent

mmrozek · 2020-10-15T10:09:28Z

src/main/resources/application.conf

+
+    # Max number of mpt nodes held in memory in state sync, before saving them into database
+    # 100k is around 60mb (each key-value pair has around 600bytes)
+    state-sync-persistBatch-size = 100000


state-sync-persist-batch-size

mmrozek · 2020-10-15T10:10:08Z

src/main/resources/application.conf

+    # If new pivot block received from network will be less than fast sync current pivot block, the re-try to chose new
+    # pivot will be scheduler after this time. Avarage block time in etc/eth is around 15s so after this time, most of
+    # network peers should have new best block
+    pivot-block-reSchedule-interval =  15.seconds


pivot-block-reschedule-interval

mmrozek · 2020-10-15T10:16:31Z

src/main/scala/io/iohk/ethereum/blockchain/sync/FastSync.scala

+      scheduler.scheduleOnce(syncConfig.pivotBlockReScheduleInterval, self, UpdatePivotBlock(updateReason))
+    }
+
+    def waitingForPivotBlockUpdate(updateReason: PivotBlockUpdateReason): Receive = handleCommonMessages orElse {
      case PivotBlockSelector.Result(pivotBlockHeader) =>
        log.info(s"New pivot block with number ${pivotBlockHeader.number} received")
        if (pivotBlockHeader.number >= syncState.pivotBlock.number) {


It will be more readable if you use pattern matching instead of nested ifs

mmrozek · 2020-10-15T10:17:36Z

src/main/scala/io/iohk/ethereum/blockchain/sync/FastSync.scala

+            reScheduleAskForNewPivot(updateReason)
+          } else {
+            updatePivotSyncState(updateReason, pivotBlockHeader)
+            syncState = syncState.copy(updatingPivotBlock = false)


I think syncState = syncState.copy(updatingPivotBlock = false) should be done in updatePivotSyncState method

mmrozek · 2020-10-15T10:21:38Z

src/main/scala/io/iohk/ethereum/blockchain/sync/SyncStateScheduler.scala

    val (nodes, newState) = state.getNodesToPersist
    nodes.foreach { case (hash, (data, reqType)) =>
      reqType match {
        case _: CodeRequest =>
          blockchain.storeEvmCode(hash, data).commit()
+          bloomFilter.put(hash)


Very minor: You could call bloomFilter.put(hash) before the pattern matching

mmrozek · 2020-10-15T10:23:36Z

src/main/scala/io/iohk/ethereum/blockchain/sync/SyncStateScheduler.scala

-    // restart. This can be done by exposing RockDb iterator to traverse whole mptnode storage.
-    // Another possibility is that there is some light way alternative in rocksdb to check key existence
-    state.memBatch.contains(req.nodeHash) || isInDatabase(req)
+    if (state.memBatch.contains(req.nodeHash)) {


Simpler: state.memBatch.contains(req.nodeHash) || (bloomFilter.mightContain(req.nodeHash) && isInDatabase(req))

kapke · 2020-10-15T11:12:30Z

src/main/scala/io/iohk/ethereum/blockchain/sync/FastSync.scala

        }

      case PersistSyncState => persistSyncState()

      case UpdatePivotBlock(state) => updatePivotBlock(state)
    }

-    private def updatePivotBlock(state: FinalBlockProcessingResult): Unit = {
+    private def updatePivotBlock(state: PivotBlockUpdateReason): Unit = {


minor: state or reason then?

reason - forgot to change

kapke · 2020-10-15T12:41:06Z

src/main/scala/io/iohk/ethereum/blockchain/sync/FastSync.scala

@@ -267,6 +275,17 @@ class FastSync(
          )
          syncState =
            syncState.updatePivotBlock(pivotBlockHeader, syncConfig.fastSyncBlockValidationX, updateFailures = true)
+
+        case NodeRestart =>


Shouldn't it be named SyncRestart? If fast-sync actor gets restarted due to some failure catched by supervisor, it's going to be restarted with clean state the same way is if it was restarted whole node.

That also makes me think - shouldn't SyncController watch for Fast-Sync restarts and start it once such happens?

So i agree that SyncRestart is more compelling name.

Question about supervision is more nuanced, in general I am not sure we handle it well across all codebase, but for this particual case we are fine as: FastySync is child of SyncController and default strategy for uncatched Exception in child is just restart it. It will probably mean some of the request in flight will be later ignored and some of the peers gets unnecessry blacklis. Those missed requests may triger some weird error condition. In my view this whole class was not designed with restarts in mind, but rather with handing all excpetion by itself.

kapke · 2020-10-15T12:43:07Z

src/main/scala/io/iohk/ethereum/blockchain/sync/FastSync.scala

+      (info.maxBlockNumber - syncConfig.pivotBlockOffset) - state.pivotBlock.number >= syncConfig.maxPivotBlockAge
+    }
+
+    private def getPeerWithTooFreshNewBlock(


is it really too fresh or more fresh enough to update to?

kapke · 2020-10-15T12:47:28Z

src/main/scala/io/iohk/ethereum/blockchain/sync/FastSync.scala

@@ -784,8 +860,15 @@ object FastSync {

  case object ImportedPivotBlock extends HeaderProcessingResult

-  sealed abstract class FinalBlockProcessingResult
+  sealed abstract class PivotBlockUpdateReason {
+    def nodeRestart: Boolean = this match {


Minor: isNodeRestart?

kapke

The code looks good!
I'd like to see how it works though. How much time should I expect to wait on mainnet before state sync starts?

KonradStaniec · 2020-10-16T10:14:54Z

So currently sync time looks like on average:

8-10h for blockchain
6-10h for state sync

And state sync starts only after blockchain is downloaded. There is also still one issue with blockchain which makes blockchain state to be stuck, and restart with some config tweak is needed to resume it, if it happen to you let me know. (we already have ticket to track it, and I suspect what is the issue here)

State Sync has higher variability in sync times as it is more parallel and it depends on number of peers , which for now is totally random due to our random walk nature of current disovery. On my machine i had finished state sync in 6h when i got 9-10 peers, and 10h when i got 3-4peers.

mmrozek

LGTM!

…le-state-sync

KonradStaniec added 3 commits October 9, 2020 08:48

[ETCM-103] Make sync state process restartable

99f67a8

[ETCM-103] Add bloom filter

6fc5a46

Merge remote-tracking branch 'origin/develop' into etcm-103/restartab…

a898ebb

…le-state-sync

KonradStaniec requested a review from mmrozek October 9, 2020 14:23

KonradStaniec added 4 commits October 12, 2020 09:27

[ETCM-103] Move hardcoded values to configuration

0c80251

[ETCM-103] Refactor handling of checking for stale block

599311b

[ETCM-103] Refactor SyncControllerSpec

b30f5e6

[ETCM-103] Add more tests

6234ff2

KonradStaniec requested a review from kapke October 14, 2020 08:55

KonradStaniec marked this pull request as ready for review October 14, 2020 08:55

Merge remote-tracking branch 'origin/develop' into etcm-103/restartab…

3f9f261

…le-state-sync

mmrozek reviewed Oct 15, 2020

View reviewed changes

kapke reviewed Oct 15, 2020

View reviewed changes

[ETCM-103] Minor renaming and cleanup

b60038a

kapke approved these changes Oct 16, 2020

View reviewed changes

mmrozek approved these changes Oct 16, 2020

View reviewed changes

Merge remote-tracking branch 'origin/develop' into etcm-103/restartab…

6e3c185

…le-state-sync

KonradStaniec merged commit 6e3c185 into develop Oct 16, 2020

KonradStaniec deleted the etcm-103/restartable-state-sync branch October 16, 2020 12:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ETCM-103] Restartable state sync #730

[ETCM-103] Restartable state sync #730

KonradStaniec commented Oct 9, 2020 •

edited

Loading

mmrozek Oct 15, 2020

mmrozek Oct 15, 2020

mmrozek Oct 15, 2020

mmrozek Oct 15, 2020

mmrozek Oct 15, 2020

mmrozek Oct 15, 2020

mmrozek Oct 15, 2020

kapke Oct 15, 2020

KonradStaniec Oct 16, 2020

kapke Oct 15, 2020

KonradStaniec Oct 16, 2020

kapke Oct 15, 2020

kapke Oct 15, 2020

kapke left a comment

KonradStaniec commented Oct 16, 2020

mmrozek left a comment

[ETCM-103] Restartable state sync #730

[ETCM-103] Restartable state sync #730

Conversation

KonradStaniec commented Oct 9, 2020 • edited Loading

Description

Proposed Solution

Possible improvements

Bonus

Testing

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kapke left a comment

Choose a reason for hiding this comment

KonradStaniec commented Oct 16, 2020

mmrozek left a comment

Choose a reason for hiding this comment

KonradStaniec commented Oct 9, 2020 •

edited

Loading