Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Improve batch job and restart behavior #30

Merged
merged 1 commit into from
Jul 6, 2022

Conversation

grembo
Copy link
Contributor

@grembo grembo commented Jan 30, 2022

The motivation for this was starting to use nomad batch jobs. Using
these depends on being able to get an exit code from the container,
which the driver didn't support so far. Changes to pot to make
this work are in bsdpot/pot#200 .

When a batch job returns an error code != 0, the restart behavior
in Nomad's restart stanza is applied. Restarts happen in the
context of the same allocation - this is also true for jobs
of type service. As the pot nomad driver would base the potName
on this data, the container name would be recycled. This
resulted in all kinds of problems when restarting tasks
rapidly.

To correct this, I changed the naming of pots from

jobname + taskname + "_" + allocId

to

taskname + "_" + invocationId + "_" + allocId

Having invocationId in there makes sure that each container
name is actually complete (for the sake of not changing pot
in this respect, invocationId + "_" + allocId are passed in as
allocId when calling pot prepare. The resulting pot names
look quite okay (the way I structured jobs, they actually look
better/are easier on the eyes, but that's subjective).

Retrieving the exit code makes use of a new pot feature in the
review mentioned above. This is a two level process:

  1. If in potWait (no Nomad restart happened):
    Check if pot start returned a distinct error code and if it did,
    use pot last-run-stats to retrieve the process' exit code.
  2. If in recoverWait (Nomad restart happened):
    Always use pot last-run-stats to retrieve the process' exit code.

In both cases, the pot container is destroyed immediately once finished
to avoid piling up stale pots that would need to be garbage collected
with pot prune (which can get quite expensive). In the future,
a parameter could be added to allow to configure this behavior.

I hope I didn't miss any potential code paths (batch jobs rely on
getting reliable results from the driver).

Example batch job definition:

job "cmd" {
  datacenters = ["dc1"]
  type = "batch"

  group "cmd-group" {
    task "command" {
      driver = "pot"

      restart { # agressive
        interval = "30m"
        attempts = 200
        delay    = "0s"
        mode     = "fail"
      }

      config {
        image = "https://pottery.example.org"
        pot = "command_13_0"
        tag = "0.1"
        command = "/bin/sh"
        args = ["-c", "'date; false'"]
      }
    }
  }
}

The motivation for this was starting to use nomad batch jobs. Using
these depends on being able to get an exit code from the container,
which the driver didn't support so far. Changes to pot to make
this work are in bsdpot/pot#200 .

When a batch job returns an error code != 0, the restart behavior
in Nomad's restart stanza is applied. Restarts happen in the
context of the same allocation - this is also true for jobs
of type `service`. As the pot nomad driver would base the potName
on this data, the container name would be recycled. This
resulted in all kinds of problems when restarting tasks
rapidly.

To correct this, I changed the naming of pots from

    jobname + taskname + "_" + allocId

to

    taskname + "_" + invocationId + "_" + allocId

Having invocationId in there makes sure that each container
name is actually complete (for the sake of not changing pot
in this respect, `invocationId + "_" + allocId` are passed in as
allocId when calling `pot prepare`. The resulting pot names
look quite okay (the way I structured jobs, they actually look
better/are easier on the eyes, but that's subjective).

Retrieving the exit code makes use of a new pot feature in the
review mentioned above. This is a two level process:

1. If in potWait (no Nomad restart happened):
   Check if `pot start` returned a distinct error code and if it did,
   use `pot last-run-stats` to retrieve the process' exit code.
2. If in recoverWait (Nomad restart happened):
   Always use `pot last-run-stats` to retrieve the process' exit code.

In both cases, the pot container is destroyed immediately once finished
to avoid piling up stale pots that would need to be garbage collected
with `pot prune` (which can get quite expensive). In the future,
a parameter could be added to allow to configure this behavior.

I hope I didn't miss any potential code paths (batch jobs rely on
getting reliable results from the driver).

Example batch job definition:

    job "cmd" {
      datacenters = ["dc1"]
      type = "batch"

      group "cmd-group" {
        task "command" {
          driver = "pot"

          restart { # agressive
            interval = "30m"
            attempts = 200
            delay    = "0s"
            mode     = "fail"
          }

          config {
            image = "https://pottery.example.org"
            pot = "command_13_0"
            tag = "0.1"
            command = "/bin/sh"
            args = ["-c", "'date; false'"]
          }
        }
      }
    }
@grembo
Copy link
Contributor Author

grembo commented Feb 11, 2022

@pizzamig Could you help with this one?

@ebarriosjr
Copy link
Contributor

Hi @grembo thanks for this! IMHO it looks good. Will reach out to @pizzamig to get write permissions to this repo now that it moved here and then i will merge this. Sorry for the long wait.

@pizzamig
Copy link
Contributor

@ebarriosjr Please check now. Every member of the organization should have write permissions to all repos.

@grembo
Copy link
Contributor Author

grembo commented Jul 6, 2022

@ebarriosjr Should I proceed and just merge it myself? Thanks!

@ebarriosjr ebarriosjr merged commit fa96104 into bsdpot:master Jul 6, 2022
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants