-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Improvements to Template + Vault during Nomad Client restarts #13313
Comments
Hi @chuckyz!
I suspect that what we're seeing here is the task failing to restore: the client's task runner hasn't successfully reattached to the task. Did this state continue after Vault connectivity was restored? As for the template runner persisting state, this is all great and aligned with some ideas we've been discussing. The currently tricky thing with templating is that the template runner runs in-process with the Nomad client but we're currently using |
Let me re-test. I think this might be the core of that particular angle. Looking at #12301 this would run consul-template and go-getter in the same way as Envoy where it's ran inside of an allocation in a bridge-style mode, yes? Thinking about that, I think we'd still have the issue of a valid Vault token but the Vault fingerprint failing, and thus I'd really like some kind of a knob exposed that says 'I don't care that Vault is down and the fingerprint is failing, just keep retrying forever but leave the alloc in the state it's in now.' Ideally if vault_retry has attempts=0 this can be short circuited to that. |
Yes, although probably not in the same network namespace as the rest of the allocation. The nitty-gritty details still need to be worked out.
|
perfect! |
First off, thank you so much for the template improvements in 1.2.4!!
We’ve implemented these in our testing environment and I’d like to make a further improvement proposal. Today, when our config management runs (Chef) we just hard restart Nomad after each run. This has served us pretty well to this point but unfortunately it’s pointed out a flaw within Nomad’s template system; especially when combined with these improvements.
I recently simulated a Vault failure (overrode the DNS for /etc/resolv.conf to 127.0.0.1), and everything behaved exactly as expected — until the client daemon restarted.
Upon restarting things the following messages started:
You can see from the times here that vault_retry seems to be ignored. I believe this is acceptable/desirable as this is a renewal of a lease outside of the template section and purely within the Vault information.
One thing this did was not cause the allocation to fail, but rather put it in a state I can’t really explain. It was running, the container was there happily working and serving traffic. However, from the control-plane it was completely broken. The CPU stats were unreported and it was as if the allocation existed but was ‘detached’ for lack of a better term.
When restarting Nomad for a 2nd time with the allocation in this state, Nomad marked it as failed and removed it from the node. I think this is not wrong behavior but it is undesirable for our use-cases.
Proposal
This is leading to the following asks:
Note: One extremely explicit call out here is I do not expect things to live through host restarts or things like docker restarts. If a host restarts/all containers stop, then all bets are off.
Use-cases
The purpose of those asks is to allow allocations to ‘survive’ through upstream problems and Nomad daemon restarts.
Attempted Solutions
Modifying all the *_retry settings.
The text was updated successfully, but these errors were encountered: