Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Sync Issues Encountered During Testing Gossipsub Effectiveness measuring #11136

Open
6 of 11 tasks
TippyFlitsUK opened this issue Aug 4, 2023 · 2 comments
Open
6 of 11 tasks
Labels
area/chain Area: Chain kind/bug Kind: Bug need/team-input Hint: Needs Team Input P1 P1: Must be resolved

Comments

@TippyFlitsUK
Copy link
Contributor

Checklist

  • This is not a security-related bug/issue. If it is, please follow please follow the security policy.
  • I have searched on the issue tracker and the lotus forum, and there is no existing related issue or discussion.
  • I am running the Latest release, the most recent RC(release canadiate) for the upcoming release or the dev branch(master), or have an issue updating to any of these.
  • I did not make any code changes to lotus.

Lotus component

  • lotus daemon - chain sync
  • lotus fvm/fevm - Lotus FVM and FEVM interactions
  • lotus miner/worker - sealing
  • lotus miner - proving(WindowPoSt/WinningPoSt)
  • lotus JSON-RPC API
  • lotus message management (mpool)
  • Other

Lotus Version

Daemon:  1.23.4-dev+mainnet+git.d082940d5+api1.5.0
Local: lotus version 1.23.4-dev+mainnet+git.d082940d5

Repro Steps

  1. Enable ElasticSearch tracing in the daemon config.toml
  2. Start daemon and miner and allow to run

Describe the Bug

#11118 aims to measure the effectiveness of Gossibsub across a minimum of 10% of active mainnet nodes.

Testing on a daemon-only node did not appear to cause any issues but significant sync loss was observed when running the tracer on a node connected to a lotus-miner instance.

Slack reference: https://filecoinproject.slack.com/archives/CP50PPW2X/p1690803449761209

The daemon doesn't "appear" to be out of sync but the miner is repeatedly falling behind with a declining baseDeltaSeconds which should ideally be a consistent and repeating value of 10.

This is similar behaviour to the sync issues experienced in the now-resolved #10906 which was also connected to Gossipsub., however, it doesn't appear to be triggered by pending mpool messages in this instance. This is effectively rendering an SP operation completely inoperable!

No sealing or proving activities were active on the miner at time of testing but both Lotus Slasher & Lotus Disputer were active on the node in question.

Daemon and miner logs are attached below.

Screenshot 2023-08-01 233516

Screenshot 2023-08-02 001057

lotus-daemon (copy).log
lotus-miner (copy).log

Logging Information

2023-08-01T23:37:10.026+0100	WARN	miner	miner/miner.go:482	completed mineOne	{"tookMilliseconds": 8, "forRound": 3087425, "baseEpoch": 3087423, "baseDeltaSeconds": 340, "nullRounds": 1, "lateStart": true, "beaconEpoch": 3183270, "lookbackEpochs": 900, "networkPowerAtLookback": "26388167536392765440", "minerPowerAtLookback": "1349032674983936", "isEligible": true, "isWinner": false, "error": null}
2023-08-01T23:37:10.038+0100	WARN	miner	miner/miner.go:482	completed mineOne	{"tookMilliseconds": 8, "forRound": 3087426, "baseEpoch": 3087423, "baseDeltaSeconds": 340, "nullRounds": 2, "lateStart": true, "beaconEpoch": 3183271, "lookbackEpochs": 900, "networkPowerAtLookback": "26388167845630410752", "minerPowerAtLookback": "1349032674983936", "isEligible": true, "isWinner": false, "error": null}
2023-08-01T23:37:10.048+0100	WARN	miner	miner/miner.go:482	completed mineOne	{"tookMilliseconds": 7, "forRound": 3087427, "baseEpoch": 3087423, "baseDeltaSeconds": 340, "nullRounds": 3, "lateStart": true, "beaconEpoch": 3183272, "lookbackEpochs": 900, "networkPowerAtLookback": "26388234503522844672", "minerPowerAtLookback": "1349032674983936", "isEligible": true, "isWinner": false, "error": null}
2023-08-01T23:37:10.061+0100	WARN	miner	miner/miner.go:482	completed mineOne	{"tookMilliseconds": 8, "forRound": 3087428, "baseEpoch": 3087423, "baseDeltaSeconds": 340, "nullRounds": 4, "lateStart": true, "beaconEpoch": 3183273, "lookbackEpochs": 900, "networkPowerAtLookback": "26388234503522844672", "minerPowerAtLookback": "1349032674983936", "isEligible": true, "isWinner": false, "error": null}
2023-08-01T23:37:10.072+0100	WARN	miner	miner/miner.go:482	completed mineOne	{"tookMilliseconds": 8, "forRound": 3087429, "baseEpoch": 3087423, "baseDeltaSeconds": 340, "nullRounds": 5, "lateStart": true, "beaconEpoch": 3183274, "lookbackEpochs": 900, "networkPowerAtLookback": "26388233407656591360", "minerPowerAtLookback": "1349032674983936", "isEligible": true, "isWinner": false, "error": null}
2023-08-01T23:37:10.084+0100	WARN	miner	miner/miner.go:482	completed mineOne	{"tookMilliseconds": 7, "forRound": 3087430, "baseEpoch": 3087423, "baseDeltaSeconds": 340, "nullRounds": 6, "lateStart": true, "beaconEpoch": 3183275, "lookbackEpochs": 900, "networkPowerAtLookback": "26388234026131881984", "minerPowerAtLookback": "1349032674983936", "isEligible": true, "isWinner": false, "error": null}
@TippyFlitsUK TippyFlitsUK added need/triage kind/bug Kind: Bug need/team-input Hint: Needs Team Input area/chain Area: Chain P1 P1: Must be resolved and removed need/triage labels Aug 4, 2023
@TippyFlitsUK
Copy link
Contributor Author

Update on further testing carried out by @Reiers:

I didn’t fall out sync - but messages started to hang in mpool for some reason. Resulting windowPosts loss. I needed to return back to a stable branch for recovery.
I will start to :magnifying_glass: through the logs tmr. And continue to test. I’ll keep you guys posted of my findings.
The good news is that this isn’t necessarily tied to this testing, since I didn’t fall out of sync during the entire run time (10 hours)

https://filecoinproject.slack.com/archives/CP50PPW2X/p1691097934171129?thread_ts=1690803449.761209&cid=CP50PPW2X

@TippyFlitsUK
Copy link
Contributor Author

Further testing today produced similar results:

Still losing sync, this time after about 60 minutes. Disabling the tracer settings fully restored sync again.

  • Slasher & Disputer - both disabled
  • No sealing or proving jobs underway, system under minimal load
  • Dual Epyc 2442 CPUs
  • 1 TiB ECC RAM

Daemon logs - https://drive.google.com/drive/folders/1IdFNglM5IUcjRr3UtnroUAMpvLjpetJN?usp=drive_link

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
area/chain Area: Chain kind/bug Kind: Bug need/team-input Hint: Needs Team Input P1 P1: Must be resolved
Projects
None yet
Development

No branches or pull requests

1 participant