-
Notifications
You must be signed in to change notification settings - Fork 172
HNC: ValidatingWebhook should have a way to avoiding operation on specific namespaces and/or object kinds #1023
Comments
Let me look into that ASAP. In the meantime, can you describe the
circumstances when the webhook became unresponsive? Was HNC under heavy
load at the time, or was it due to some kind of outage (e.g. the pod being
rescheduled onto another node)?
…On Tue, Aug 18, 2020 at 12:11 AM Taichi Nakashima ***@***.***> wrote:
We've been using HNC for our experimental GKE environment. But because HNC
validation webhook objects.hnc.x-k8s.io was often non-responsive, various
system operations such as leader elections failing and this caused cluster
itself unstable...
I think non-responsive itself is an issue (in this time, since it's the
experimental environment, we don't have enough monitoring and clue for the
reason of non-responsive ... we would like to notify issue it faced when we
found the root cause) but, at the same time, as official documentation
says
<https://kubernetes.io/docs/reference/access-authn-authz/extensible-admission-controllers/#avoiding-operating-on-the-kube-system-namespace>,
we should avoid using validation webhook on kube-system related operation
(or some other critical components). It caused the problems which we faced.
To skip the validation webhook, we can use namespaceSelector. But to use
it we need the label. Since the manifest itself is free to modify so we can
introduce our own defined label but I think HNC itself can provide the
label for it? For example, how about introducing
admission.hnc.x-k8s.io/ignore like gatekeeper project does
<https://github.com/open-policy-agent/gatekeeper#exempting-namespaces-from-the-gatekeeper-admission-webhook-using---exempt-namespace-flag>?
How do you think?
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#1023>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AE43PZFTV55GWSI4PLGESTTSBH5QNANCNFSM4QCWVT5Q>
.
|
It was running on the experimental cluster (the version is 1.15.12-gke.9) which does not run many workloads (the number of namespaces is around 30). Since it's the experimental cluster, we don't receive any production traffic, and we don't often touch it (only when we try something new). I deployed it exactly the same way as the user guide described (no manifest modification). Actually, cluster instability was notified by GCP support (as I described above we don't have any monitoring on this cluster). To recover it, I just uninstalled it. So I don't have much information about what happened ... I tried to search log a bit but I cloud not find any clue about this non-responsive state. I will try to install it again later and see this happens again. Then I can tell you more information. |
Thanks! One more question - are you using the OSS version of HNC? If so, you can modify the validating webhook configuration to have a very short timeout (eg 2s) so that if it goes down, objects will still be allowed to go through quickly. We should probably have done that already, but I didn't realize how long the default timeout was (it's 30s since we're using the If you're using the ConfigSync/ACM version, you won't be able to modify the webhook config directly, but since this is your test cluster, perhaps you could install the OSS version and see if the shorter timeout helps. If so, we'll get that into the next ConfigSync version. |
It's worth pointing out that the object webhooks are mainly there for safety, not correctness. If they don't run at all, then the object reconciler will still try to undo any changes to the cluster as quickly as it can. This won't defend against an attack, but it will be enough to stop a mistake. So having a short timeout doesn't substantially increase the risk of your cluster. |
Yes, we are using the OSS version of HNC.
Yeah, let us configure it in the next time.
Ok but I still can not understand why HNC needs to check all objects to be applied to the cluster? My understanding of HNC is it handles its CRD, |
That's a good question. The short answer is that we could change HNC so that it only has a webhook for the types of objects it's tracking. The longer answer, and the reason we haven't done this yet, is because the list of object types can change dynamically and it was far easier to write a single static webhook configuration than it was to automatically generate all possible configurations at runtime. Our belief (which turned out to be incorrect) was that the webhook could quickly ignore all object types we didn't care about. As a short-term fix, I think we'll change the timeout to 2s in a patch release (e.g. 0.5.2). Then there are two medium-term directions we can take:
Ideally we'd have both, and we can probably get both done in the next few months. Do you have a preference for which you'd prefer first? |
I think the priority point of view, it's important to have 1 than 2 because it's more important to avoid blocking the kube-system's critical operations. And, at the same time, I think it's also important for HNC itself. For example, if we don't exclude HNC namespace and deploy it correctly, deadlock may happen (webhook asks validation of HNC deployment, but HNC doesn't work. We faced this issue in the different controller before).
Does this mean HNC itself will put the label to the namespace? I'm not sure HNC should have that responsibility or not but should be helpful.
I don't think we need to have a dynamic webhook configuration? What I thought is instead of writing "*" but we can write only supported resources like |
Agreed on all counts, thanks.
I'd prefer a K8s-defined label like
The problem is that users can add arbitrary types. |
Ah, ok now I understood it. I thought supported resources were limited. Thank you for the clarification. |
If you are fine, I would like to contribute to this change. |
That would be great, thanks! You can submit the patch to the hnc-v0.5
branch as well as master, and then I'll release 0.5.2 next week.
…On Thu, Aug 20, 2020 at 9:15 PM Taichi Nakashima ***@***.***> wrote:
As a short-term fix, I think we'll change the timeout to 2s in a patch
release
If you are fine, I would like to contribute to this change.
—
You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub
<#1023 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AE43PZB6XC6PDYQE7CXL7HDSBXDCBANCNFSM4QCWVT5Q>
.
|
Sent it 👉 #1039 |
I've been struggling to reproduce this. Simply killing the deployment or the webhook service isn't enough - in those cases, the webhook fails immediately and everything works just fine. I had to add a |
Looks like this will (eventually) be fixed in core K8s: For now I'm going to put this on the backlog, I don't think there's an urgent issue given the changes we've made so far. |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-contributor-experience at kubernetes/community. |
/remove-lifecycle stale |
Since this was filed by @tcnksm I think it would be nice to have a go at this. /good-first-issue |
@adrianludwin: Please ensure the request meets the requirements listed here. If this request no longer meets these requirements, the label can be removed In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Fixed by #1444 |
We've been using HNC for our experimental GKE environment. But because HNC validation webhook
objects.hnc.x-k8s.io
was often non-responsive, various system operations such as leader elections failing and this caused cluster itself unstable...I think non-responsive itself is an issue (in this time, since it's the experimental environment, we don't have enough monitoring and clue for the reason of non-responsive ... we would like to notify issue it faced when we found the root cause) but, at the same time, as official documentation says, we should avoid using validation webhook on
kube-system
related operation (or some other critical components). It caused the problems which we faced.To skip the validation webhook, we can use
namespaceSelector
. But to use it we need the label. SinceValidatingWebhookConfiguration
manifest itself is free to modify, we can introduce our own defined label for this purpose but I think HNC itself can provide the label for it? For example, how about introducingadmission.hnc.x-k8s.io/ignore
like gatekeeper project does? How do you think?The text was updated successfully, but these errors were encountered: