Run job deregistering in a single transaction #4861

notnoop · 2018-11-10T03:40:30Z

Upon investigating this case further, we determined the issue to be a race between applying JobBatchDeregisterRequest fsm operation and processing job-deregister evals.

Processing job-deregister evals should wait until the FSM log message finishes applying, by using the snapshot index. However, with JobBatchDeregister, any single individual job deregistering was applied accidentally incremented the snapshot index and resulted into processing job-deregister evals. When a Nomad server receives an eval for a job in the batch that is yet to be deleted, we accidentally re-run it depending on the state of allocation.

This change ensures that we delete deregister all of the jobs and inserts all evals in a single transactions, thus blocking processing related evals until deregistering complete.

Fixes #4299 Upon investigating this case further, we determined the issue to be a race between applying `JobBatchDeregisterRequest` fsm operation and processing job-deregister evals. Processing job-deregister evals should wait until the FSM log message finishes applying, by using the snapshot index. However, with `JobBatchDeregister`, any single individual job deregistering was applied accidentally incremented the snapshot index and resulted into processing job-deregister evals. When a Nomad server receives an eval for a job in the batch that is yet to be deleted, we accidentally re-run it depending on the state of allocation. This change ensures that we delete deregister all of the jobs and inserts all evals in a single transactions, thus blocking processing related evals until deregistering complete.

notnoop · 2018-11-10T12:40:30Z

I have tested this change with my re-production setup and didn't notice a case of a job retrying even after 10 hours - previously a case would appear in ~5-10 minutes. So success?!

nomad/state/state_store.go

dadgar · 2018-11-12T19:11:08Z

LGTM after comments

nomad/fsm.go

nomad/state/state_store.go

tantra35 · 2018-11-18T16:37:13Z

@notnoop after we backport this to 0.8.6 we begin see time to time when we manual make GC via curl -XPUT http://localhost:4646/v1/system/gc

Nov 18 19:29:38 consulnomad-2 nomad.sh[24477]:     2018/11/18 19:29:38 [DEBUG] memberlist: TCP connection from=127.0.0.1:41524
Nov 18 19:29:38 consulnomad-2 nomad.sh[24477]:     2018/11/18 19:29:38.900118 [DEBUG] worker: dequeued evaluation 42512a64-2cb9-0e2e-f8f7-9ee9428366be
Nov 18 19:29:38 consulnomad-2 nomad.sh[24477]:     2018/11/18 19:29:38.900525 [DEBUG] sched.core: forced job GC
Nov 18 19:29:38 consulnomad-2 nomad.sh[24477]:     2018/11/18 19:29:38.900720 [DEBUG] sched.core: job GC: 2 jobs, 3 evaluations, 2 allocs eligible
Nov 18 19:29:38 consulnomad-2 nomad.sh[24477]: panic: runtime error: invalid memory address or nil pointer dereference
Nov 18 19:29:38 consulnomad-2 nomad.sh[24477]: [signal SIGSEGV: segmentation violation code=0x1 addr=0x38 pc=0x74ce52]
Nov 18 19:29:38 consulnomad-2 nomad.sh[24477]: goroutine 45 [running]:
Nov 18 19:29:38 consulnomad-2 nomad.sh[24477]: github.com/hashicorp/nomad/vendor/github.com/hashicorp/go-immutable-radix.(*Iterator).Next(0xc420571d20, 0x1b190e0, 0xc420868d00, 0x1d6bce0, 0xc4205dd8e0, 0xc4205dd910, 0xc4205dd910)
Nov 18 19:29:38 consulnomad-2 nomad.sh[24477]:         /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/go-immutable-radix/iter.go:81 +0xc2
Nov 18 19:29:38 consulnomad-2 nomad.sh[24477]: github.com/hashicorp/nomad/vendor/github.com/hashicorp/go-memdb.(*radixIterator).Next(0xc4205dd840, 0x1bee89a, 0xb)
Nov 18 19:29:38 consulnomad-2 nomad.sh[24477]:         /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/go-memdb/txn.go:634 +0x2e
Nov 18 19:29:38 consulnomad-2 nomad.sh[24477]: github.com/hashicorp/nomad/nomad/state.(*StateStore).deleteJobVersions(0xc420333a00, 0xe6, 0xc4204bd520, 0xc4205c7d40, 0xc420571ce0, 0x0)
Nov 18 19:29:38 consulnomad-2 nomad.sh[24477]:         /opt/gopath/src/github.com/hashicorp/nomad/nomad/state/state_store.go:1095 +0x1ae
Nov 18 19:29:38 consulnomad-2 nomad.sh[24477]: github.com/hashicorp/nomad/nomad/state.(*StateStore).DeleteJobTxn(0xc420333a00, 0xe6, 0xc42070b0b0, 0x7, 0xc42070b090, 0xd, 0xc4205c7d40, 0x1d640a0, 0xc4205dd770)
Nov 18 19:29:38 consulnomad-2 nomad.sh[24477]:         /opt/gopath/src/github.com/hashicorp/nomad/nomad/state/state_store.go:1067 +0x46f
Nov 18 19:29:38 consulnomad-2 nomad.sh[24477]: github.com/hashicorp/nomad/nomad.(*nomadFSM).handleJobDeregister(0xc4202ef6e0, 0xe6, 0xc42070b090, 0xd, 0xc42070b0b0, 0x7, 0x197d001, 0xc4205c7d40, 0x0, 0x0)
Nov 18 19:29:38 consulnomad-2 nomad.sh[24477]:         /opt/gopath/src/github.com/hashicorp/nomad/nomad/fsm.go:556 +0x1ab
Nov 18 19:29:38 consulnomad-2 nomad.sh[24477]: github.com/hashicorp/nomad/nomad.(*nomadFSM).applyBatchDeregisterJob.func1(0xc4205c7d40, 0x1c6c8a0, 0xc4205c7d40)
Nov 18 19:29:38 consulnomad-2 nomad.sh[24477]:         /opt/gopath/src/github.com/hashicorp/nomad/nomad/fsm.go:524 +0x12f
Nov 18 19:29:38 consulnomad-2 nomad.sh[24477]: github.com/hashicorp/nomad/nomad/state.(*StateStore).WithWriteTransaction(0xc420333a00, 0xc4207d9c58, 0x0, 0x0)
Nov 18 19:29:38 consulnomad-2 nomad.sh[24477]:         /opt/gopath/src/github.com/hashicorp/nomad/nomad/state/state_store.go:3936 +0x79
Nov 18 19:29:38 consulnomad-2 nomad.sh[24477]: github.com/hashicorp/nomad/nomad.(*nomadFSM).applyBatchDeregisterJob(0xc4202ef6e0, 0xc420728001, 0x42d, 0x4bf, 0xe6, 0x0, 0x0)
Nov 18 19:29:38 consulnomad-2 nomad.sh[24477]:         /opt/gopath/src/github.com/hashicorp/nomad/nomad/fsm.go:522 +0x1b1
Nov 18 19:29:38 consulnomad-2 nomad.sh[24477]: github.com/hashicorp/nomad/nomad.(*nomadFSM).Apply(0xc4202ef6e0, 0xc420cf8c20, 0x2ba2d40, 0x3)
Nov 18 19:29:38 consulnomad-2 nomad.sh[24477]:         /opt/gopath/src/github.com/hashicorp/nomad/nomad/fsm.go:242 +0xf81
Nov 18 19:29:38 consulnomad-2 nomad.sh[24477]: github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft.(*Raft).runFSM.func1(0xc4205dd360)
Nov 18 19:29:38 consulnomad-2 nomad.sh[24477]:         /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft/fsm.go:57 +0x15a
Nov 18 19:29:38 consulnomad-2 nomad.sh[24477]: github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft.(*Raft).runFSM(0xc420536580)
Nov 18 19:29:38 consulnomad-2 nomad.sh[24477]:         /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft/fsm.go:120 +0x2fa
Nov 18 19:29:38 consulnomad-2 nomad.sh[24477]: github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft.(*Raft).(github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft.runFSM)-fm()
Nov 18 19:29:38 consulnomad-2 nomad.sh[24477]:         /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft/api.go:506 +0x2a
Nov 18 19:29:38 consulnomad-2 nomad.sh[24477]: github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft.(*raftState).goFunc.func1(0xc420536580, 0xc420440100)
Nov 18 19:29:38 consulnomad-2 nomad.sh[24477]:         /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft/state.go:146 +0x53
Nov 18 19:29:38 consulnomad-2 nomad.sh[24477]: created by github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft.(*raftState).goFunc
Nov 18 19:29:38 consulnomad-2 nomad.sh[24477]:         /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft/state.go:144 +0x66

after that nomad server cluster fully demaged, and stops functional properly

tantra35 · 2018-11-18T16:39:38Z

on our test stand we launch folow jobs

job "test"
{
        datacenters = ["test"]
        type = "batch"

        constraint
        {
                attribute = "${attr.kernel.name}"
                value = "linux"
        }

        constraint
        {
                attribute = "${node.unique.name}"
                operator = "="
                value = "dockerworker-1"
        }

        task "diamondbcapacitycollector"
        {
                driver = "exec"

                config
                {
                        command = "sleep"
                        args = ["600"]
                }

                logs
                {
                        max_files = 3
                        max_file_size = 10
                }

                resources
                {
                        cpu = 100
                        memory = 300
                }
        }
}

and

job "test_debug-00"
{
        datacenters = ["test"]
        priority = 50

        constraint
        {
                attribute = "${attr.kernel.name}"
                value = "linux"
        }

        constraint
        {
                distinct_hosts = true
        }

        update
        {
                stagger = "10s"
                max_parallel = 1
        }

        meta
        {
                CONTAINER_VERSION = "02"
        }

        group test
        {
                count = 2

                task "test_debug_task"
                {
                        driver = "raw_exec"
                        kill_timeout = "1m"

                        artifact
                        {
                                source = "http://docker.service.consul/playrix-vault_debug-${NOMAD_META_CONTAINER_VERSION}.tar.gz"

                                options
                                {
                                        archive=false
                                }
                        }

                        config
                        {
                                command = "bash"
                                args = ["-c", "sleep 10; exit 10"]
                        }

                        vault
                        {
                                policies = ["bind_password","vault_debug-consul"]
                                change_mode   = "noop"
                        }

                        template
                        {
                                data = <<EOH
                                CONSUL_TOKEN="{{with secret "secrets/consul/plr/global/creds/vault_debug-consul"}}{{.Data.token}}{{end}}"
EOH
                                destination = "secrets/consul_token.env"
                                env = true
                        }

                        logs
                        {
                                max_files = 3
                                max_file_size = 10
                        }

                        resources
                        {
                                memory = 100
                                cpu = 200
                        }
                }
        }
}

tantra35 · 2018-11-18T16:41:05Z

before cluster crash the state of test job was

vagrant@consulnomad-1:~/nomad$ nomad status test
ID            = test
Name          = test
Submit Date   = 2018-11-18T17:06:04+03:00
Type          = batch
Priority      = 50
Datacenters   = test
Status        = dead
Periodic      = false
Parameterized = false

Summary
Task Group                 Queued  Starting  Running  Failed  Complete  Lost
diamondbcapacitycollector  0       0         0        0       2         0

Allocations
ID        Node ID   Task Group                 Version  Desired  Status    Created    Modified
56bb5e80  a2455479  diamondbcapacitycollector  2        run      complete  2h22m ago  2h12m ago
4fb97421  a2455479  diamondbcapacitycollector  0        stop     complete  2h25m ago  2h24m ago
vagrant@consulnomad-1:~/nomad$ curl -XPUT http://localhost:4646/v1/system/gc
vagrant@consulnomad-1:~/nomad$ nomad status test
Error querying search with id: "Unexpected response code: 500 (rpc error: rpc error: failed to get conn: dial tcp 192.168.142.103:4647: connect: connection refu                                                                                               sed)"

tantra35 · 2018-11-18T16:42:16Z

We absolutly sure that made backport right

tantra35 · 2018-11-18T16:45:31Z

First of all we compare our version with 0.8.7 branch, and also we can provide our patch

tantra35 · 2018-11-18T16:46:58Z

Also i must said that this not 100% reproducible and we catched this only two(2) times

notnoop · 2018-11-18T18:03:33Z

@tantra35 Thank you so much for reporting the issue and providing a test case. I'll dig into this tomorrow (on Monday) and follow up.

tantra35 · 2018-11-18T20:08:27Z

@notnoop thanks fo reaction, i also must said that i'm not sure that the reason in exactly this PR, because our version, also include

#4764
#4839
#4842

backported, but when we test them, we doesn;t see any issues.

Its seems that this happens when we wait when batch job succsefully ends, then we launch it again, so our steps is follow

nomad run test.nomad
<wait when it stops normaly>
nomad run test.nomad
<wait some time>
curl -XPUT http://localhost:4646/v1/system/gc

notnoop · 2018-11-20T02:49:04Z

@tantra35 Thanks again for reporting the issue - I was able to reproduce the panic reliably and have a fix in #4903 - would you be able to test that PR in your environment by any chance?

tantra35 · 2018-11-20T20:41:02Z

@notnoop We tested this patch and seems all worked well, now we can't launch the same job if it previous version in dead state, only if we make GC, nomad allow to us launch new

preetapan · 2018-11-20T21:58:26Z

@tantra35 re: we can't launch the same job if it previous version in dead state, only if we make GC, nomad allow to us launch new

if jobs are resubmitted with a change it should run without needing a GC. If a batch job's allocs are all complete and you resubmit it with no changes Nomad will not rerun completed allocations.

tantra35 · 2018-11-21T00:15:20Z

@preetapan before @notnoop made final fixes, we had ability to launch batch job which doesn't changed, and you can saw this on provided node status in this issue

notnoop · 2018-11-21T00:45:05Z

That's interesting - I wasn't able to reproduce it with different nomad releases. I captured my output for Nomad 0.8.2 for example https://gist.github.com/notnoop/151105c99070d93333bed23aec7ce42c - and you can see that resubmitting the same job doesn't trigger a new allocation (i.e. doesn't run again) until the job is modified.

If you'd like to investigate this further, please open a new ticket with the version of Nomad you are using and the commands you ran, and we'd follow up there - I'm afraid of losing valuable insights and issues in PR comments.

Thanks!

tantra35 · 2018-11-24T12:00:57Z

@notnoop this 100% respodisible in case which i descibed in #4532, yes there we manualy stop batch jobs. And also i mast mansio i see gain that i can launch

vagrant@consulnomad-1:~$ nomad status e2914cd3
ID                  = e2914cd3
Eval ID             = f826017e
Name                = test.diamondbcapacitycollector[0]
Node ID             = aafdcb7a
Job ID              = test
Job Version         = 0
Client Status       = complete
Client Description  = <none>
Desired Status      = run
Desired Description = <none>
Created             = 11m40s ago
Modified            = 1m1s ago

Task "diamondbcapacitycollector" is "dead"
Task Resources
CPU      Memory   Disk     IOPS  Addresses
100 MHz  300 MiB  300 MiB  0

Task Events:
Started At     = 2018-11-24T11:37:18Z
Finished At    = 2018-11-24T11:47:18Z
Total Restarts = 0
Last Restart   = N/A

Recent Events:
Time                       Type        Description
2018-11-24T14:47:18+03:00  Terminated  Exit Code: 0
2018-11-24T14:37:18+03:00  Started     Task started by client
2018-11-24T14:36:39+03:00  Task Setup  Building Task Directory
2018-11-24T14:36:39+03:00  Received    Task received by client
vagrant@consulnomad-1:~$ nomad status test
ID            = test
Name          = test
Submit Date   = 2018-11-24T14:36:38+03:00
Type          = batch
Priority      = 50
Datacenters   = test
Status        = dead
Periodic      = false
Parameterized = false

Summary
Task Group                 Queued  Starting  Running  Failed  Complete  Lost
diamondbcapacitycollector  0       0         0        0       1         0

Allocations
ID        Node ID   Task Group                 Version  Desired  Status    Created     Modified
e2914cd3  aafdcb7a  diamondbcapacitycollector  0        run      complete  11m45s ago  1m6s ago
vagrant@consulnomad-1:~$ nomad run ./test.nomad
Error getting job struct: Error getting jobfile from "./test.nomad": source path error: stat /home/vagrant/test.nomad: no such file or directory
vagrant@consulnomad-1:~$ cd ./nomad/
vagrant@consulnomad-1:~/nomad$ nomad run ./test.nomad
==> Monitoring evaluation "db5d4074"
    Evaluation triggered by job "test"
    Allocation "f77d2310" created: node "aafdcb7a", group "diamondbcapacitycollector"
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "db5d4074" finished with status "complete"
vagrant@consulnomad-1:~/nomad$ nomad status
ID    Type   Priority  Status   Submit Date
test  batch  50        running  2018-11-24T14:48:41+03:00
vagrant@consulnomad-1:~/nomad$ nomad status test
ID            = test
Name          = test
Submit Date   = 2018-11-24T14:48:41+03:00
Type          = batch
Priority      = 50
Datacenters   = test
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group                 Queued  Starting  Running  Failed  Complete  Lost
diamondbcapacitycollector  0       0         1        0       1         0

Allocations
ID        Node ID   Task Group                 Version  Desired  Status    Created     Modified
f77d2310  aafdcb7a  diamondbcapacitycollector  1        run      running   16s ago     10s ago
e2914cd3  aafdcb7a  diamondbcapacitycollector  0        run      complete  12m19s ago  1m40s ago

tantra35 · 2018-11-24T12:02:13Z

Here we can see that version of allocation changes, but actual job description not

tantra35 · 2018-11-24T12:02:58Z

This happens randomly, for test vagrant environment

github-actions · 2023-02-24T02:16:51Z

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

notnoop requested a review from dadgar November 10, 2018 03:42

dadgar requested changes Nov 12, 2018

View reviewed changes

nomad/state/state_store.go Show resolved Hide resolved

nomad/state/state_store.go Show resolved Hide resolved

preetapan reviewed Nov 12, 2018

View reviewed changes

nomad/fsm.go Show resolved Hide resolved

preetapan approved these changes Nov 12, 2018

View reviewed changes

Comment public functions and batch write txn

9405473

notnoop force-pushed the b-batch-deregister-transaction branch from a422d8c to 9405473 Compare November 12, 2018 21:09

dadgar approved these changes Nov 12, 2018

View reviewed changes

notnoop merged commit 91ea405 into master Nov 13, 2018

endocrimes deleted the b-batch-deregister-transaction branch November 13, 2018 04:55

preetapan reviewed Nov 13, 2018

View reviewed changes

nomad/fsm.go Show resolved Hide resolved

preetapan reviewed Nov 13, 2018

View reviewed changes

nomad/state/state_store.go Show resolved Hide resolved

github-actions bot locked as resolved and limited conversation to collaborators Feb 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Run job deregistering in a single transaction #4861

Run job deregistering in a single transaction #4861

notnoop commented Nov 10, 2018

notnoop commented Nov 10, 2018

dadgar commented Nov 12, 2018

tantra35 commented Nov 18, 2018

tantra35 commented Nov 18, 2018

tantra35 commented Nov 18, 2018

tantra35 commented Nov 18, 2018

tantra35 commented Nov 18, 2018

tantra35 commented Nov 18, 2018

notnoop commented Nov 18, 2018

tantra35 commented Nov 18, 2018

notnoop commented Nov 20, 2018

tantra35 commented Nov 20, 2018

preetapan commented Nov 20, 2018

tantra35 commented Nov 21, 2018

notnoop commented Nov 21, 2018

tantra35 commented Nov 24, 2018

tantra35 commented Nov 24, 2018

tantra35 commented Nov 24, 2018

github-actions bot commented Feb 24, 2023

Run job deregistering in a single transaction #4861

Run job deregistering in a single transaction #4861

Conversation

notnoop commented Nov 10, 2018

notnoop commented Nov 10, 2018

dadgar commented Nov 12, 2018

tantra35 commented Nov 18, 2018

tantra35 commented Nov 18, 2018

tantra35 commented Nov 18, 2018

tantra35 commented Nov 18, 2018

tantra35 commented Nov 18, 2018

tantra35 commented Nov 18, 2018

notnoop commented Nov 18, 2018

tantra35 commented Nov 18, 2018

notnoop commented Nov 20, 2018

tantra35 commented Nov 20, 2018

preetapan commented Nov 20, 2018

tantra35 commented Nov 21, 2018

notnoop commented Nov 21, 2018

tantra35 commented Nov 24, 2018

tantra35 commented Nov 24, 2018

tantra35 commented Nov 24, 2018

github-actions bot commented Feb 24, 2023