Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Fix abnormal node state when leader transfer fails #247

Closed
cserwen opened this issue Oct 28, 2022 · 0 comments · Fixed by #248
Closed

Fix abnormal node state when leader transfer fails #247

cserwen opened this issue Oct 28, 2022 · 0 comments · Fixed by #248

Comments

@cserwen
Copy link
Contributor

cserwen commented Oct 28, 2022

Question

We have three nodes in dLedger cluster: n0, n1, n2. n0 is preferedLeader

  • Firstly, n0 is leader. But there is a problem with the machine where n0 is located. Therefore, n2 is elected as the new leader.
  • When n0 recovers, n2 will transfer the leader to n0.
  • But n0 did not respond to n2's transfer request in time.
2022-10-26 08:21:20 INFO NettyServerPublicExecutor_3 - [n0] [ChangeRoleToCandidate] from term: 56 and currTerm: 55
2022-10-26 08:22:15 INFO StateMaintainer - n0_[INCREASE_TERM] from 55 to 56

n0 received the transfer request at 08:21:20, but the election was initiated at 08:22:15, causing the transfer request to fail and n2 to become writable. However, at this time, n0 is candidate, and the data cannot be synchronized. As a result, the lagging position of n0 is greater than 1000, and n2 no longer initiates a transfer request.

Because n0 is candidate, the data cannot be synchronized.

Solution

We have two ways

  • n0 actively rolls back to follower and rolls back term

Term only increases but does not decrease, not in line with the paper

latest term server has seen (initialized to 0 on first boot, increases monotonically)

The paper mentions that when a candidate receives an append request from the leader, if currentTerm <= leader's term, it should become a follower.

While waiting for votes, a candidate may receive an AppendEntries RPC from another server claiming to be leader. If the leader’s term (included in its RPC) is at least as large as the candidate’s current term, then the candidate recognizes the leader as legitimate and returns to follower state. If the term in the RPC is smaller than the candidate’s current term, then the candidate rejects the RPC and continues in candidate state

  • The leader node increases the term and becomes a candidate to initiate an election. n0 participates in the voting process normally and returns to normal.

Reference 5.1 of the paper mentions:

if one server’s current term is smaller than the other’s, then it updates its current term to the larger value.

Therefore, we can fix it according to Method 2.

RongtongJin pushed a commit that referenced this issue Nov 17, 2022
Co-authored-by: dengzhiwen1 <dengzhiwen1@xiaomi.com>
humkum pushed a commit to humkum/dledger that referenced this issue Nov 20, 2023
Co-authored-by: dengzhiwen1 <dengzhiwen1@xiaomi.com>
yuz10 pushed a commit that referenced this issue Dec 16, 2024
Co-authored-by: dengzhiwen1 <dengzhiwen1@xiaomi.com>
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant