Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

fix(manager): gossip any pending blocks not gossiped before (in case of restart or crash) #1045

Merged
merged 17 commits into from
Aug 28, 2024

Conversation

srene
Copy link
Contributor

@srene srene commented Aug 27, 2024

PR Standards

Opening a pull request should be able to meet the following requirements

--

PR naming convention: https://hackmd.io/@nZpxHZ0CT7O5ngTp0TP9mg/HJP_jrm7A


In case the node crashes there maybe blocks that are not gossiped, but applied. In this case, blocks will never be gossiped (and neither added to blocksync store) and therefore not retrievable from P2P.
This pr fixes it by adding blocks to blocksync right after are produced and stored to db, this way blocks will be aways available in p2p, even if not gossiped.

Close #1044

<-- Briefly describe the content of this pull request -->

For Author:

  • Targeted PR against correct branch
  • included the correct type prefix in the PR title
  • Linked to Github issue with discussion and accepted design
  • Targets only one github issue
  • Wrote unit and integration tests
  • All CI checks have passed
  • Added relevant godoc comments

For Reviewer:

  • confirmed the correct type prefix in the PR title
  • Reviewers assigned
  • confirmed all author checklist items have been addressed

After reviewer approval:

  • In case targets main branch, PR should be squashed and merged.
  • In case PR targets a release branch, PR should be rebased.

@srene srene requested a review from a team as a code owner August 27, 2024 11:48
@srene srene marked this pull request as draft August 27, 2024 11:48
@srene srene self-assigned this Aug 27, 2024
@srene srene marked this pull request as ready for review August 27, 2024 12:43
block/produce.go Outdated
@@ -120,6 +120,12 @@ func (m *Manager) produceApplyGossip(ctx context.Context, allowEmpty bool, nextP
return nil, nil, fmt.Errorf("gossip block: %w", err)
}

m.State.LastGossipedHeight = block.Header.Height
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why updating it here and not in the gossipBlock function? that way seems much less error-prone to have one place (cause now for example in the node recovery function for gossiping were not updating it)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved inside function

block/manager.go Outdated
@@ -328,3 +330,22 @@ func (m *Manager) setDA(daconfig string, dalcKV store.KV, logger log.Logger) err
m.Retriever = retriever
return nil
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another edge case I see here is the following: I was a sequencer previously, and than I wasn't (rotation. but continued being a full node), and than I am again. In this case, my store LastGossipedHeight wasn't updated when I was a full node (or at least I don't see it). and than I need to get stuck on gossiping potentially a big amount.

I think we need some sort of a sane skew after which we avoid gossiping (i.e if I'm 1M blocks away from the latest state, than probably it doesn't make sense).

Copy link
Contributor Author

@srene srene Aug 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i added an update for LastGossipedHeight inside UpdateTargetHeight, this will cause full-nodes also update LastGossipedHeight when retrieving form p2p or DA. this way, when rotating the value will be updated i think

block/state.go Outdated
@@ -32,6 +32,10 @@ func (m *Manager) LoadStateOnInit(store store.Store, genesis *tmtypes.GenesisDoc
}

m.State = s

if m.State.LastGossipedHeight == 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

worth adding a comment here explainig the reasoning

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment added

@srene srene requested a review from omritoptix August 27, 2024 16:14
block/manager.go Outdated
@@ -285,6 +287,15 @@ func (m *Manager) UpdateTargetHeight(h uint64) {
break
}
}

// updating LastGossipedHeight to avoid gossiping long queue of blocks when sequencer rotation
if m.State.LastGossipedHeight < h {
Copy link
Contributor

@omritoptix omritoptix Aug 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 issues I have with that:

  1. the function is called update target height, but it actually also updates last gossiped height
  2. the param is called last gossiped height for a full node, but it's tracking actually a potential sequencer only metric even if you're not a sequencer which may never be relevant for you. so maybe we need to make it a bit more make sense cause it feels hacky now (as if one would see that he would think it's the last height the full node got from gossiping but it's not)
  3. we call the store save state on each iteration vs just updating it once. not sure why we need to iterate vs just saving h? it's just overriding in a loop.

Copy link
Contributor Author

@srene srene Aug 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in order to avoid these issues, i refactored the pr and i implemented a new and simpler approach.
the new approach relies the block is added to blocksync after is produced but not on gossip (this is not removed but it will be used only by full-nodes). this way if the block is not gossiped because a panic, it will be available through blocksync, and full-nodes will be able to sync via p2p anyway

@srene srene force-pushed the srene/1044-fix-block-not-gossiped-after-crash branch from fb1bb8c to 33a5d63 Compare August 28, 2024 09:37
@srene srene requested a review from omritoptix August 28, 2024 09:57
block/block.go Outdated
@@ -48,6 +50,16 @@ func (m *Manager) applyBlock(block *types.Block, commit *types.Commit, blockMeta
return fmt.Errorf("save block: %w", err)
}

gossipedBlock := p2p.BlockData{Block: *block, Commit: *commit}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for cleanliness I'd suggest moving this code part to a saveP2PBlock method or some other name which makes sense an add a godoc to this function.

also we why use context.TODO()? s

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved to new func

@@ -519,6 +519,11 @@ func (c *Client) blockGossipReceived(ctx context.Context, block []byte) {
c.logger.Error("Publishing event.", "err", err)
}
if c.conf.BlockSyncEnabled {
_, err := c.store.LoadBlockCid(gossipedBlock.Block.Header.Height)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this for optimization purposes?
if so I see we're missing the AddBlockRecieved method in that case - is that fine?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, this is just to avoid storing the block twice if already stored.

@omritoptix omritoptix merged commit e9709f6 into main Aug 28, 2024
6 checks passed
@omritoptix omritoptix deleted the srene/1044-fix-block-not-gossiped-after-crash branch August 28, 2024 11:54
# for free to join this conversation on GitHub. Already have an account? # to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

block is not found using blocksync after crash
2 participants