Fail alloc if alloc runner prestart hooks fail #5905

notnoop · 2019-06-29T15:17:28Z

When an alloc runner prestart hook fails, the task runners aren't invoked
and they remain in a pending state.

This leads to terrible results, some of which are:

Lockup in GC process as reported in client: fix gc deadlock when ar.prerun errors #5861
Lockup in shutdown process as TR.Shutdown() waits for WaitCh to be closed
Alloc not being restarted/rescheduled to another node (as it's still in
pending state)
Unexpected restart of alloc on a client restart, potentially days/weeks after
alloc expected start time!

Here, we treat all tasks to have failed if alloc runner prestart hook fails.
This fixes the lockups, and permits the alloc to be rescheduled on another node.

While it's desirable to retry alloc runner in such failures, I opted to treat it
out of scope. I'm afraid of some subtles about alloc and task runners and their
idempotency that's better handled in a follow up PR.

This might be one of the root causes for
#5840 .

notnoop · 2019-06-29T15:28:16Z

client/allocrunner/taskrunner/task_runner.go

+// Mark a task as failed and not to run.  Aimed to be invoked when alloc runner
+// prestart hooks failed.
+// Should never be called with Run().
+func (tr *TaskRunner) MarkFailedDead(reason string) {


Here, we introduce another function to call instead of Run() breaking an invariant; but I had a very hard time rationalizing reusing Run() when we never want to call any of the logic there and we would need to signal that the task should fail.

I felt that MarkFailedDead is a reasonable compromise that does the bare minimum to mark task as failed.

schmichael

Great job tying a lot of loose threads together and the test looks great. I'm all for getting this merged ASAP as it's a huge gap in AR's lifecycle management, but there's one case that still concerns me:

On client agent restart if AR.prerun() errors, tasks may be leaked along with logmon and consul services.

2 solutions come to mind, but I'm sure there are more approaches:

Go back to always calling TR.Run() and relying on a dead/terminal-esque check to immediately try to cleanup any existing driver handle and run stop hooks before exiting.
Add a TR.Destroy method that is called from AR.Destroy that does a final cleanup pass (stop hooks and driver handles).

Option 1 is more 0.9 status quo, but seems fragile, confusing, and possibly tricky to implement right.

Option 2 is more 0.8 status quo IIRC, and I think it was dropped in favor of ~always calling TR.Run and using stop hooks.

TR.Destroy would be a best-effort cleanup: run stop hooks, call DestroyTask, and ignore any errors as in the common case all of those operations would have already been done. This makes testing it a little annoying as it probably should never return an error, but we can either have Destroy() wrap destroyImpl() error and test the impl or just assert expected actions took place.

When an alloc runner prestart hook fails, the task runners aren't invoked and they remain in a pending state. This leads to terrible results, some of which are: * Lockup in GC process as reported in #5861 * Lockup in shutdown process as TR.Shutdown() waits for WaitCh to be closed * Alloc not being restarted/rescheduled to another node (as it's still in pending state) * Unexpected restart of alloc on a client restart, potentially days/weeks after alloc expected start time! Here, we treat all tasks to have failed if alloc runner prestart hook fails. This fixes the lockups, and permits the alloc to be rescheduled on another node. While it's desirable to retry alloc runner in such failures, I opted to treat it out of scope. I'm afraid of some subtles about alloc and task runners and their idempotency that's better handled in a follow up PR. This might be one of the root causes for #5840 .

Handle when prestart failed while restoring a task, to prevent accidentally leaking consul/logmon processes.

github-actions · 2023-02-07T02:15:39Z

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

notnoop mentioned this pull request Jun 29, 2019

client: fix gc deadlock when ar.prerun errors #5861

Closed

notnoop requested a review from schmichael June 29, 2019 15:24

notnoop commented Jun 29, 2019

View reviewed changes

notnoop mentioned this pull request Jul 1, 2019

Crash on restart with 0.9.1 #5840

Closed

schmichael approved these changes Jul 1, 2019

View reviewed changes

Mahmood Ali added 2 commits July 2, 2019 18:35

run post-run/post-stop task runner hooks

9980239

Handle when prestart failed while restoring a task, to prevent accidentally leaking consul/logmon processes.

notnoop force-pushed the b-ar-failed-prestart branch from 0398196 to 9980239 Compare July 2, 2019 10:39

notnoop merged commit 72f67c5 into master Jul 2, 2019

notnoop deleted the b-ar-failed-prestart branch July 2, 2019 12:25

langmartin mentioned this pull request Jul 19, 2019

Runaway nomad process after Nomad client reboot #5984

Closed

github-actions bot locked as resolved and limited conversation to collaborators Feb 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fail alloc if alloc runner prestart hooks fail #5905

Fail alloc if alloc runner prestart hooks fail #5905

notnoop commented Jun 29, 2019

notnoop Jun 29, 2019

schmichael left a comment

github-actions bot commented Feb 7, 2023

Fail alloc if alloc runner prestart hooks fail #5905

Fail alloc if alloc runner prestart hooks fail #5905

Conversation

notnoop commented Jun 29, 2019

notnoop Jun 29, 2019

Choose a reason for hiding this comment

schmichael left a comment

Choose a reason for hiding this comment

github-actions bot commented Feb 7, 2023