Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

refreshes ContactInfo.outset before initializing validator #3135

Merged

Conversation

behzadnouri
Copy link

Problem

Nodes join gossip during bootstrap process with a stub contact-info which in particular has invalid TVU socket address.

Once the bootstrap is done they re-join gossip a 2nd time with a fully populated contact-info, but this contact-info has an outset timestamp older than the 1st one because it was initiated earlier.

In v2.0, the outset timestamp determines which contact-info overrides the other, so the v2.0 nodes refrain from updating their CRDS table with the fully initialized contact-info.

Summary of Changes

The commit refreshes ContactInfo.outset before initializing validator so that it overrides the one pushed to gossip by the bootstrap stage.

@behzadnouri behzadnouri requested a review from a team as a code owner October 10, 2024 19:16
Nodes join gossip during bootstrap process with a stub contact-info
which in particular has invalid TVU socket address.

Once the bootstrap is done they re-join gossip a 2nd time with a fully
populated contact-info, but this contact-info has an outset timestamp
older than the 1st one because it was initiated earlier.

In v2.0 the outset timestamp determines which contact-info overrides the
other, so the v2.0 nodes refrain from updating their CRDS table with the
fully initialized contact-info.

The commit refreshes ContactInfo.outset before initializing the
validator so that it overrides the one pushed to the gossip by the
bootstrap stage.
@willhickey
Copy link

Hey @steviez @gregcusack, we'll need approval from a subject matter expert in addition to the backport reviewers.

@gregcusack
Copy link

curious what the testing process for this will be if we backport a commit to v1.18 which has been running on mainnet for some time now.

@behzadnouri
Copy link
Author

curious what the testing process for this will be if we backport a commit to v1.18 which has been running on mainnet for some time now.

I hope we get enough soak time on testnet and mainnet before proceeding with v2.0 upgrade.

@willhickey
Copy link

willhickey commented Oct 10, 2024

I hope we get enough soak time on testnet and mainnet before proceeding with v2.0 upgrade.

We can adjust the upgrade schedule as needed. How much time would you want this to run on each cluster?

The original plan was to do some quick upgrade/downgrade cycles on testnet, and then start ramping v2.0 on mainnet-beta. Something like this:

Testnet, over the course of ~2-4 days:

  1. Restart on v1.18
  2. Upgrade live to v2.0
  3. Downgrade live to v1.18
  4. Upgrade to v2.0

Mainnet-beta:

  1. Ask for volunteers to take 10% of stake to v2.0
  2. 1 week after 1: ask for volunteers to take 25% of stake to v2.0
  3. 1 week after 2: Recommend v2.0 for general adoption

If we publish a new v1.18 release with this change most mainnet-beta operators will ignore it unless we tell them it's a critical patch. As I understand it, this problem only manifests if there's a restart while the cluster is split v1.18 / v2.0. So we could recommend v1.18.26, for mainnet-beta, but only push it as a critical patch if we happen to have a restart during the upgrade. Is that correct?

@behzadnouri
Copy link
Author

Without this patch, if some v1.18 node decides to restart and runs the bootstrap code, then v2.0 nodes will pick up its stub contact-info from bootstrap and stick to it, in which case the node will be effectively left out of cluster because the stub contact-info only has a gossip address.

So I would say it is still much safer if nodes upgrade to this code before more stake is running v2.0 branch.

@gregcusack
Copy link

been thinking if this is going to cause any issues for v1.18. I don't think it will.

v1.18 uses ContactInfo for startup in rpc_bootstrap(), but then really relies on LegacyContactInfo for communication with the rest of the cluster. It's really just sending and accepting ContactInfo in the background.

With the upgrade to v2, now v1.18 needs ContactInfo to communicate with v2, since v2 doesn't use LegacyContactInfo for much. v2 has an override check that v1.18 does not have that looks at ContactInfo.outset, which is why if a v2 node receives an outset older than the current one, the new value will fail to override.

Copy link

@gregcusack gregcusack left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!

Copy link

@steviez steviez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just about identical to #2666 when you control for the addition of the helpers

@willhickey
Copy link

Devnet testing succeeded:
https://discord.com/channels/428295358100013066/1293704291437121537/1294175307162451988

The v2.0 version of this patch (#2681) went into v2.0.7 which was announced for testnet on 2024-08-25, and then used for the testnet restart on 2024-08-26

@anza-xyz/backport-reviewers Any other testing you'd like to see here?

Copy link

@t-nelson t-nelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as per discord discussion, lgtm

@behzadnouri behzadnouri merged commit c2b3500 into anza-xyz:v1.18 Oct 11, 2024
33 checks passed
@behzadnouri behzadnouri deleted the patch-contact-info-outset-v1.18 branch October 11, 2024 16:31
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants