Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Add term check on the internal streams, to avoid wrong closures of leader controller #513

Merged
merged 3 commits into from
Sep 14, 2024

Conversation

merlimat
Copy link
Collaborator

During a leader election, an old leader can force back the new leader into a follower mode.

This does not affect the correctness (eg: the old leader's requests are then ignored because the term is invalid), though it causes the shard to be unavailable until a new leader election is triggered.

The simplified sequence is like:

  1. oxia-1 is the old leader in term: 4
  2. Coordinator initiates a new leader election and fences all the nodes
  3. oxia-0 and oxia-2 are fenced correctly, while oxia-1 is ignored, and will be retried in background
  4. oxia-2 is elected leader (term: 5)
  5. oxia-1 is still running the leader controller (because it never got successfully fenced), and it keeps trying to replicated to oxia-0 and oxia-2 .

The replicate requests keep failing (because the term has already changed to 5), though the first request has the effect of putting oxia-2 back in follower mode.

We are now in a state where coordinator is happy thinking oxia-2 as the leader, though oxia-2 went back to follower mode. A new leader election is not triggered because oxia-2 looks healthy overall (the health check is done at the pod level, not at shard level).

@merlimat
Copy link
Collaborator Author

merlimat commented Sep 13, 2024

Coordinator logs

Sep 10 22:31:48.880 │INF│ Detected failure on shard leader component=shard-controller leader.public=oxia-1:6648 leader.internal=oxia-1:6649 namespace=default shard=39
Sep 10 22:31:48.880 │INF│ Starting leader election component=shard-controller namespace=default shard=39 term=5
Sep 10 22:31:55.306 │INF│ Processed newTerm response component=shard-controller entry-id.term="4" entry-id.offset="1659" namespace=default server-address.public=oxia-0:6648 server-address.internal=oxia-0:6649 shard=39
Sep 10 22:31:55.307 │INF│ Processed newTerm response component=shard-controller entry-id.term="4" entry-id.offset="1659" namespace=default server-address.public=oxia-2:6648 server-address.internal=oxia-2:6649 shard=39
Sep 10 22:31:55.407 │INF│ Successfully moved ensemble to a new term component=shard-controller followers=[{ server-address={ public=oxia-0:6648 internal=oxia-0:6649 } entry-id={ term=4 offset=1659 } }] namespace=default new-leader.public=oxia-2:6648 new-leader.internal=oxia-2:6649 shard=39 term=5
Sep 10 22:31:55.408 │WRN│ Failed to newTerm node component=shard-controller error.error="rpc error: code = Canceled desc = context canceled" error.kind=*status.Error error.stack=null namespace=default node=oxia-1:6649 shard=39
Sep 10 22:32:18.897 │INF│ Elected new leader component=shard-controller leader.public=oxia-2:6648 leader.internal=oxia-2:6649 namespace=default shard=39 term=5
Sep 10 22:32:18.897 │INF│ Node has failed in leader election, retrying component=shard-controller follower.public=oxia-1:6648 follower.internal=oxia-1:6649 namespace=default shard=39
Sep 10 22:32:19.081 │WRN│ Failed to newTerm, retrying later component=shard-controller error.error="rpc error: code = Code(106) desc = node is not leader for shard 39" error.kind=*status.Error error.stack=null follower.public=oxia-1:6648 follower.internal=oxia-1:6649 namespace=default retry-after=1464.45687 shard=39 term=5
Sep 10 22:32:21.137 │WRN│ Failed to newTerm, retrying later component=shard-controller error.error="rpc error: code = Code(106) desc = node is not leader for shard 39" error.kind=*status.Error error.stack=null follower.public=oxia-1:6648 follower.internal=oxia-1:6649 namespace=default retry-after=1567.098543 shard=39 term=5
Sep 10 22:32:22.952 │WRN│ Failed to newTerm, retrying later component=shard-controller error.error="rpc error: code = Code(106) desc = node is not leader for shard 39" error.kind=*status.Error error.stack=null follower.public=oxia-1:6648 follower.internal=oxia-1:6649 namespace=default retry-after=2313.475213 shard=39 term=5
Sep 10 22:32:25.318 │WRN│ Failed to newTerm, retrying later component=shard-controller error.error="rpc error: code = Code(106) desc = node is not leader for shard 39" error.kind=*status.Error error.stack=null follower.public=oxia-1:6648 follower.internal=oxia-1:6649 namespace=default retry-after=2368.899588 shard=39 term=5
Sep 10 22:32:28.876 │WRN│ Failed to newTerm, retrying later component=shard-controller error.error="rpc error: code = Code(106) desc = node is not leader for shard 39" error.kind=*status.Error error.stack=null follower.public=oxia-1:6648 follower.internal=oxia-1:6649 namespace=default retry-after=4971.413786 shard=39 term=5
Sep 10 22:32:33.982 │WRN│ Failed to newTerm, retrying later component=shard-controller error.error="rpc error: code = Code(106) desc = node is not leader for shard 39" error.kind=*status.Error error.stack=null follower.public=oxia-1:6648 follower.internal=oxia-1:6649 namespace=default retry-after=11369.749879 shard=39 term=5
Sep 10 22:32:45.871 │WRN│ Failed to newTerm, retrying later component=shard-controller error.error="rpc error: code = Code(106) desc = node is not leader for shard 39" error.kind=*status.Error error.stack=null follower.public=oxia-1:6648 follower.internal=oxia-1:6649 namespace=default retry-after=11811.46189 shard=39 term=5
Sep 10 22:32:58.060 │WRN│ Failed to newTerm, retrying later component=shard-controller error.error="rpc error: code = Code(106) desc = node is not leader for shard 39" error.kind=*status.Error error.stack=null follower.public=oxia-1:6648 follower.internal=oxia-1:6649 namespace=default retry-after=18758.572051 shard=39 term=5

Oxia-1 (old-leader)

Sep 10 22:31:55.438 │INF│ Ack stream finished component=follower-cursor follower=oxia-2:6649 namespace=default shard=39 term=4
Sep 10 22:31:55.438 │INF│ Ack stream finished component=follower-cursor follower=oxia-0:6649 namespace=default shard=39 term=4
Sep 10 22:32:03.005 │ERR│ Error while pushing entries to follower component=follower-cursor error.error=EOF error.kind=*errors.errorString error.stack=null follower=oxia-2:6649 namespace=default retry-after=64.806002 shard=39 term=4
Sep 10 22:32:03.024 │ERR│ Error while pushing entries to follower component=follower-cursor error.error=EOF error.kind=*errors.errorString error.stack=null follower=oxia-0:6649 namespace=default retry-after=78.797359 shard=39 term=4
Sep 10 22:32:03.104 │INF│ Successfully attached cursor follower ack-offset=1659 component=follower-cursor follower=oxia-2:6649 namespace=default shard=39 term=4
Sep 10 22:32:03.124 │INF│ Successfully attached cursor follower ack-offset=1659 component=follower-cursor follower=oxia-0:6649 namespace=default shard=39 term=4
Sep 10 22:32:03.234 │WRN│ Error while receiving acks component=follower-cursor error.error="rpc error: code = Code(101) desc = oxia: invalid term" error.kind=*status.Error error.stack=null follower=oxia-2:6649 namespace=default shard=39 term=4
Sep 10 22:32:03.234 │ERR│ Error while pushing entries to follower component=follower-cursor error.error="context canceled" error.kind=*errors.errorString error.stack=null follower=oxia-2:6649 namespace=default retry-after=144.778157 shard=39 term=4

Oxia-2 (new leader)

Sep 10 22:31:55.296 │INF│ Received NewTerm request component=internal-rpc-server peer=192.168.8.6:57406 req.namespace=default req.shardId="39" req.term="5"
Sep 10 22:31:55.296 │INF│ Found follower, initiating new term component=internal-rpc-server followerTerm=4 peer=192.168.8.6:57406 req.namespace=default req.shardId="39" req.term="5"
Sep 10 22:31:55.306 │INF│ Follower successfully initialized in new term component=follower-controller last-entry.term="4" last-entry.offset="1659" namespace=default shard=39 term=5
Sep 10 22:31:55.408 │INF│ Received BecomeLeader request component=internal-rpc-server peer=192.168.8.6:57406 request.namespace=default request.shardId="39" request.term="5" request.replicationFactor=3 request.followerMaps.oxia-0:6649.term="4" request.followerMaps.oxia-0:6649.offset="1659"
Sep 10 22:31:55.416 │INF│ [JOB 1] WAL file /data/db/default/shard-39/000023.log with log number 000023 stopped reading at offset: 0; replayed 0 keys in 0 batches component=pebble shard=39
Sep 10 22:31:55.424 │INF│ Created leader controller component=leader-controller namespace=default shard=39 term=5
Sep 10 22:31:55.425 │INF│ Added follower component=leader-controller follower=oxia-0:6649 follower-head-entry.term="4" follower-head-entry.offset="1659" head-offset=1659 leader-election-head-entry.term="4" leader-election-head-entry.offset="1659" namespace=default shard=39 term=5
Sep 10 22:31:55.425 │INF│ Applying all pending entries to database commit-offset=1658 component=leader-controller head-offset=1659 namespace=default shard=39 term=5
Sep 10 22:31:55.425 │INF│ Successfully attached cursor follower ack-offset=1659 component=follower-cursor follower=oxia-0:6649 namespace=default shard=39 term=5
Sep 10 22:31:55.426 │INF│ All sessions component=session-manager count=3 namespace=default shard=39 term=5
Sep 10 22:31:55.426 │INF│ Session started client-identity=8d58549d-1351-4476-adab-611825eecdda component=session namespace=default session-id=750 session-timeout=30000 shard=39
Sep 10 22:31:55.426 │INF│ Session started client-identity=d101e0b7-0ad4-4803-9d21-a708a1247d44 component=session namespace=default session-id=771 session-timeout=30000 shard=39
Sep 10 22:31:55.426 │INF│ Session started client-identity=21ea4496-d839-4ed8-8f54-7dd2612621de component=session namespace=default session-id=742 session-timeout=30000 shard=39
Sep 10 22:31:55.426 │INF│ Started leading the shard component=leader-controller head-offset=1659 namespace=default shard=39 term=5
Sep 10 22:32:03.125 │INF│ Received Replicate request component=internal-rpc-server namespace=default peer=192.168.9.20:37408 shard=39
Sep 10 22:32:03.125 │INF│ Closing leader controller component=leader-controller namespace=default shard=39 term=5

The replicate request (in term 4) from oxia-1, caused the closing of leader controller (in term 5) and switch back to follower mode in oxia-2.

@merlimat merlimat merged commit 1934d55 into streamnative:main Sep 14, 2024
5 checks passed
@merlimat merlimat deleted the stream-term-check branch September 14, 2024 00:54
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant