-
Notifications
You must be signed in to change notification settings - Fork 14
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Improve batch job and restart behavior (#30)
The motivation for this was starting to use nomad batch jobs. Using these depends on being able to get an exit code from the container, which the driver didn't support so far. Changes to pot to make this work are in bsdpot/pot#200 . When a batch job returns an error code != 0, the restart behavior in Nomad's restart stanza is applied. Restarts happen in the context of the same allocation - this is also true for jobs of type `service`. As the pot nomad driver would base the potName on this data, the container name would be recycled. This resulted in all kinds of problems when restarting tasks rapidly. To correct this, I changed the naming of pots from jobname + taskname + "_" + allocId to taskname + "_" + invocationId + "_" + allocId Having invocationId in there makes sure that each container name is actually complete (for the sake of not changing pot in this respect, `invocationId + "_" + allocId` are passed in as allocId when calling `pot prepare`. The resulting pot names look quite okay (the way I structured jobs, they actually look better/are easier on the eyes, but that's subjective). Retrieving the exit code makes use of a new pot feature in the review mentioned above. This is a two level process: 1. If in potWait (no Nomad restart happened): Check if `pot start` returned a distinct error code and if it did, use `pot last-run-stats` to retrieve the process' exit code. 2. If in recoverWait (Nomad restart happened): Always use `pot last-run-stats` to retrieve the process' exit code. In both cases, the pot container is destroyed immediately once finished to avoid piling up stale pots that would need to be garbage collected with `pot prune` (which can get quite expensive). In the future, a parameter could be added to allow to configure this behavior. I hope I didn't miss any potential code paths (batch jobs rely on getting reliable results from the driver). Example batch job definition: job "cmd" { datacenters = ["dc1"] type = "batch" group "cmd-group" { task "command" { driver = "pot" restart { # agressive interval = "30m" attempts = 200 delay = "0s" mode = "fail" } config { image = "https://pottery.example.org" pot = "command_13_0" tag = "0.1" command = "/bin/sh" args = ["-c", "'date; false'"] } } } }
- Loading branch information
Showing
4 changed files
with
116 additions
and
20 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters