Injector failure mode prevents Pod deletion #500

cjyar · 2023-07-17T21:08:36Z

Describe the bug
If a Pod has the agent-inject annotation yet gets created without the injected sidecars, then any future update to the Pod will trigger the injector to add the sidecars. If the Pod has already been created, these attempts to modify spec.containers or spec.initContainers will fail, thus causing the Pod update to fail. If the Pod has a finalizer, it will be impossible to remove the finalizer, and therefore it will be impossible to remove the Pod.

It's easy to enter this failure mode if there's a temporary connectivity error between the Kubernetes apiserver and the vault agent injector.

To Reproduce
Steps to reproduce the behavior:

Deploy vault-agent-injector. Use the default settings which result in the MutatingWebhookConfiguration having spec.failurePolicy: Ignore and spec.timeoutSeconds: 30.
Force a vault-agent-injector failure by running kubectl scale deploy --replicas=0.
Create a Job with the agent-inject annotation. Configure the Job's entrypoint to wait a while, e.g. sleep 300.
After the Pod is created, it should take 30 seconds before it starts running due to the MWC timeout.
After the Pod is running, but before it exits, restart vault-agent-injector with kubectl scale deploy --replicas=2.
After the Pod exits, it will fail to be deleted.
The Kubernetes controller-manager (job-controller) will try to delete the batch.kubernetes.io/job-tracking finalizer, but it will fail with Pod "pod-name" is invalid: spec.initContainers: Forbidden: pod updates may not add or remove containers.
Attempts to delete the Pod will succeed, but the Pod won't be deleted because the finalizer won't be removed.
Attempts to manually delete the finalizer using kubectl edit pod will fail with the same error.

Expected behavior
The vault-agent-injector shouldn't block Pods from being finalized.

Environment

Kubernetes version:
- Distribution or cloud vendor (OpenShift, EKS, GKE, AKS, etc.): GKE v1.25.10-gke.1200
- Other configuration options or runtime services (istio, etc.): none
vault-k8s version: 1.2.1

Workaround
To delete a Pod that's stuck in this state, use kubectl edit pod to delete the vault.hashicorp.com/agent-inject annotation.

Fix
Even though this behavior is surprising, and the Kubernetes error message isn't super helpful, I think Kubernetes is actually doing the right thing. I think the injector could be modified to address this problem, though. If it's being asked to mutate a Pod, and the Pod's status.phase is a string other than Pending, then it should do nothing. In other words, if a Pod has already been created, the injector shouldn't try to add containers because the Pod's spec.initContainers and spec.containers are immutable.

The text was updated successfully, but these errors were encountered:

komapa · 2023-08-22T14:54:54Z

Hello, we were wondering over here why even watch for UPDATE events in the operator? What is it accounting for? Thanks!

cjyar · 2023-08-23T17:04:58Z

@komapa You're probably right that only watching for CREATE events would be a cleaner fix for this problem. It would be nice to hear from the vault-k8s maintainers.

alculquicondor · 2023-11-03T19:40:08Z

I don't think the fix #501 is bullet proof.

Let's assume the pod from step 3 fails to schedule. The user realizes this and attempts to delete the Pod (or Job).

At this point, the Pod still has phase: Pending, but its containers also can't be mutated.

The more appropriate fix is to simply NOT add a container if the webhook is called for an UPDATE operation.

/reopen

tomhjp · 2024-05-10T10:46:36Z

Sorry for letting this linger, and thanks for the reports. I'm going to work on getting hashicorp/vault-helm#783 merged which will fix this properly as per the above comments.

tomhjp · 2024-05-10T10:47:51Z

#619 is also related for anyone using the deployment yaml instead of the helm chart.

tomhjp · 2024-05-10T11:26:06Z

#783 is merged, hopefully this fix will stick with the next release of the helm chart.

cjyar added the bug Something isn't working label Jul 17, 2023

cjyar mentioned this issue Jul 17, 2023

Only inject Pods that are Pending. #501

Merged

tomhjp closed this as completed in #501 Aug 16, 2023

tomhjp reopened this May 10, 2024

tomhjp closed this as completed May 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Injector failure mode prevents Pod deletion #500

Injector failure mode prevents Pod deletion #500

cjyar commented Jul 17, 2023

komapa commented Aug 22, 2023

cjyar commented Aug 23, 2023

alculquicondor commented Nov 3, 2023 •

edited

Loading

tomhjp commented May 10, 2024

tomhjp commented May 10, 2024

tomhjp commented May 10, 2024

Injector failure mode prevents Pod deletion #500

Injector failure mode prevents Pod deletion #500

Comments

cjyar commented Jul 17, 2023

komapa commented Aug 22, 2023

cjyar commented Aug 23, 2023

alculquicondor commented Nov 3, 2023 • edited Loading

tomhjp commented May 10, 2024

tomhjp commented May 10, 2024

tomhjp commented May 10, 2024

alculquicondor commented Nov 3, 2023 •

edited

Loading