-
Notifications
You must be signed in to change notification settings - Fork 619
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
raft: Fix possible deadlocks #1537
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
PR moby#1310 ("Fix infinite election loop") solved a problem with election loops on startup, by delaying new proposals until the leader has committed all its existing entries. This ensures that the state machine doesn't call ApplyStoreActions to commit a previous entry from the log while an new proposal is in process - since they both acquire a write lock over the memory store, which would deadlock. Unfortunately, there is still a race condition which can lead to a similar deadlock. processInternalRaftRequest makes sure that proposals arent't started after the manager loses its status as the leader by first registering a wait for the raft request, then checking the leadership status. If the leadership status is lost before calling register(), then the leadership check should fail, since it happens afterwards. Conversely, if the leadership status is lost after calling register(), then cancelAll() in Run() will make sure this wait gets cancelled. The problem with this is that the new code in PR moby#1310 calls cancelAll() *before* setting the leadership status. So it's possible that first we cancel all outstanding requests, then a new request is registered and successfully checks that we are still the leader, then we set leader to "false". This request never gets cancelled, so it causes a deadlock. Nothing can be committed to the store until this request goes through, but it can't go through if we're not the leader anymore. To fix this, swap the order of cancelAll so it happens after we change the leadership status variable. This means that no matter how the goroutines are interleaved, a new request will either cancel itself or be cancelled by Run when leadership is lost. I'm aware that this is ugly and I'm open to suggestions for refactoring or abstracting. Also, out of extra caution, call cancelAll in the situation which would lead to a deadlock if there were any outstanding raft requests. Signed-off-by: Aaron Lehmann <aaron.lehmann@docker.com>
…mitted This was being done after processCommitted, which could cause a deadlock (if not for the cautionary cancelAll call added to processEntry in the previous commit). Signed-off-by: Aaron Lehmann <aaron.lehmann@docker.com>
This function calls Propose, which will block if there's no leader. Suppose we lose leadership just before calling Propose, and now there's no leader. processInternalRaftRequest will block until a new leader is elected, but this may interfere with the leader election. If Run receives a Ready value that has new items to commit to the store, it will try to do that, and get stuck there because processInternalRaftRequest is called by a store function that has the write lock held. Then the Run goroutine will get stuck, and not be able to send outgoing messages. To solve this, create a context with a cancel function for processInternalRaftRequest, and make it so that cancelling the wait calls this cancel function and unblocks Propose. Signed-off-by: Aaron Lehmann <aaron.lehmann@docker.com>
@aaronlehmann Nice, thanks! will try with my tests. |
Current coverage is 53.84% (diff: 81.48%)@@ master #1537 diff @@
==========================================
Files 82 82
Lines 13414 13415 +1
Methods 0 0
Messages 0 0
Branches 0 0
==========================================
+ Hits 7204 7223 +19
+ Misses 5224 5209 -15
+ Partials 986 983 -3
|
LK4D4
approved these changes
Sep 14, 2016
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code makes sense for me. I've run basic tests and they work perfectly.
mrjana
approved these changes
Sep 14, 2016
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
# for free
to join this conversation on GitHub.
Already have an account?
# to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This includes three commits that fix possible deadlocks between
ApplyStoreActions
andprocessInternalRaftRequest
:raft: Fix race that leads to raft deadlock
PR #1310 ("Fix infinite election loop") solved a problem with election
loops on startup, by delaying new proposals until the leader has
committed all its existing entries. This ensures that the state machine
doesn't call
ApplyStoreActions
to commit a previous entry from the logwhile an new proposal is in process - since they both acquire a write
lock over the memory store, which would deadlock.
Unfortunately, there is still a race condition which can lead to a
similar deadlock.
processInternalRaftRequest
makes sure that proposalsarent't started after the manager loses its status as the leader by
first registering a wait for the raft request, then checking the
leadership status. If the leadership status is lost before calling
register()
, then the leadership check should fail, since it happensafterwards. Conversely, if the leadership status is lost after calling
register()
, thencancelAll()
inRun()
will make sure this wait getscancelled.
The problem with this is that the new code in PR #1310 calls
cancelAll()
before setting the leadership status. So it's possible that first we
cancel all outstanding requests, then a new request is registered and
successfully checks that we are still the leader, then we set leader to
"false". This request never gets cancelled, so it causes a deadlock.
Nothing can be committed to the store until this request goes through,
but it can't go through if we're not the leader anymore.
To fix this, swap the order of
cancelAll
so it happens after we changethe leadership status variable. This means that no matter how the
goroutines are interleaved, a new request will either cancel itself or
be cancelled by
Run
when leadership is lost. I'm aware that this is uglyand I'm open to suggestions for refactoring or abstracting.
Also, out of extra caution, call
cancelAll
in the situation which wouldlead to a deadlock if there were any outstanding raft requests.
raft: If no longer leader, cancel proposals before calling processCommitted
This was being done after
processCommitted
, which could cause a deadlock(if not for the cautionary cancelAll call added to processEntry in the
previous commit).
raft: Fix possible deadlock in processInternalRaftRequest
This function calls
Propose
, which will block if there's no leader.Suppose we lose leadership just before calling
Propose
, and now there'sno leader.
processInternalRaftRequest
will block until a new leader iselected, but this may interfere with the leader election. If
Run
receives a
Ready
value that has new items to commit to the store, itwill try to do that, and get stuck there because
processInternalRaftRequest
is called by a store function that has thewrite lock held. Then the Run goroutine will get stuck, and not be able
to send outgoing messages.
To solve this, create a context with a cancel function for
processInternalRaftRequest
, and make it so that cancelling the waitcalls this cancel function and unblocks
Propose
.cc @LK4D4