Improve batch job and restart behavior #30
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The motivation for this was starting to use nomad batch jobs. Using
these depends on being able to get an exit code from the container,
which the driver didn't support so far. Changes to pot to make
this work are in bsdpot/pot#200 .
When a batch job returns an error code != 0, the restart behavior
in Nomad's restart stanza is applied. Restarts happen in the
context of the same allocation - this is also true for jobs
of type
service
. As the pot nomad driver would base the potNameon this data, the container name would be recycled. This
resulted in all kinds of problems when restarting tasks
rapidly.
To correct this, I changed the naming of pots from
to
Having invocationId in there makes sure that each container
name is actually complete (for the sake of not changing pot
in this respect,
invocationId + "_" + allocId
are passed in asallocId when calling
pot prepare
. The resulting pot nameslook quite okay (the way I structured jobs, they actually look
better/are easier on the eyes, but that's subjective).
Retrieving the exit code makes use of a new pot feature in the
review mentioned above. This is a two level process:
Check if
pot start
returned a distinct error code and if it did,use
pot last-run-stats
to retrieve the process' exit code.Always use
pot last-run-stats
to retrieve the process' exit code.In both cases, the pot container is destroyed immediately once finished
to avoid piling up stale pots that would need to be garbage collected
with
pot prune
(which can get quite expensive). In the future,a parameter could be added to allow to configure this behavior.
I hope I didn't miss any potential code paths (batch jobs rely on
getting reliable results from the driver).
Example batch job definition: