-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Nomad Server: fatal error: concurrent map read and map write #4607
Comments
We have the same issues, but we still have issues described here #4477, and simple restarts hung nomad servers, after they stuck Here our logs But patch that you are provided doesn't looks legal, diff --git a/nomad/eval_broker.go b/nomad/eval_broker.go
index 54e32e7a8..59fde90dc 100644
--- a/nomad/eval_broker.go
+++ b/nomad/eval_broker.go
@@ -763,8 +763,8 @@ func (b *EvalBroker) runDelayedEvalsWatcher(ctx context.Context) {
return
case <-timerChannel:
// remove from the heap since we can enqueue it now
- b.delayHeap.Remove(&evalWrapper{eval})
b.l.Lock()
+ b.delayHeap.Remove(&evalWrapper{eval})
b.stats.TotalWaiting -= 1
b.enqueueLocked(eval, eval.Type)
b.l.Unlock()
@@ -777,11 +777,14 @@ func (b *EvalBroker) runDelayedEvalsWatcher(ctx context.Context) {
// nextDelayedEval returns the next delayed eval to launch and when it should be enqueued.
// This peeks at the heap to return the top. If the heap is empty, this returns nil and zero time.
func (b *EvalBroker) nextDelayedEval() (*structs.Evaluation, time.Time) {
+ b.l.RLock()
// If there is nothing wait for an update.
if b.delayHeap.Length() == 0 {
+ b.l.RUnlock()
return nil, time.Time{}
}
nextEval := b.delayHeap.Peek()
+ b.l.RUnlock()
if nextEval == nil {
return nil, time.Time{} It will by very helpful if guys from hashicorp clarify this moment |
my patch made the cluster non-crashy, so it did what i wanted it to :) |
@preetapan can you or someone on the nomad team please triage this? |
Hi, Nomad version: Operating system and Environment details
|
Sorry for the late response, been traveling this week. This is on our radar and I'll have a PR up for it by early next week. |
@preetapan thanks, for info. Any suggestion for some hotfix ? Server restart or sth. |
@pkrolikowski yes restarting the server should work. Draining several nodes at once increases the chances of the race condition occurring again so I would suggest avoiding that till I merge a fix to master. Sorry for the bug, |
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
Nomad version
Nomad v0.8.4 (dbee1d7d051619e90a809c23cf7e55750900742a)
Operating system and Environment details
Issue
Nomad server panics with
fatal error: concurrent map read and map write
It happened on 2 out of the 3 Nomad servers in the cluster
Restarting the Nomad servers causes them to panic the same way after a short time
Clients seem unaffected
go test -race -run 'TestEval*' ./nomad/
seem to output tons of data race issues in generalReproduction steps
Was draining a bunch of nodes, not sure if its related or not
Nomad Server logs (if appropriate)
https://gist.github.com/jippi/7b41d871cdbda6e24d202e09f0925438
Hotfix
The text was updated successfully, but these errors were encountered: