HNC: ValidatingWebhook should have a way to avoiding operation on specific namespaces and/or object kinds #1023

tcnksm · 2020-08-18T04:11:34Z

We've been using HNC for our experimental GKE environment. But because HNC validation webhook objects.hnc.x-k8s.io was often non-responsive, various system operations such as leader elections failing and this caused cluster itself unstable...

I think non-responsive itself is an issue (in this time, since it's the experimental environment, we don't have enough monitoring and clue for the reason of non-responsive ... we would like to notify issue it faced when we found the root cause) but, at the same time, as official documentation says, we should avoid using validation webhook on kube-system related operation (or some other critical components). It caused the problems which we faced.

To skip the validation webhook, we can use namespaceSelector. But to use it we need the label. Since ValidatingWebhookConfiguration manifest itself is free to modify, we can introduce our own defined label for this purpose but I think HNC itself can provide the label for it? For example, how about introducing admission.hnc.x-k8s.io/ignore like gatekeeper project does? How do you think?

The text was updated successfully, but these errors were encountered:

adrianludwin · 2020-08-18T04:39:29Z

Let me look into that ASAP. In the meantime, can you describe the circumstances when the webhook became unresponsive? Was HNC under heavy load at the time, or was it due to some kind of outage (e.g. the pod being rescheduled onto another node)?

…

On Tue, Aug 18, 2020 at 12:11 AM Taichi Nakashima ***@***.***> wrote: We've been using HNC for our experimental GKE environment. But because HNC validation webhook objects.hnc.x-k8s.io was often non-responsive, various system operations such as leader elections failing and this caused cluster itself unstable... I think non-responsive itself is an issue (in this time, since it's the experimental environment, we don't have enough monitoring and clue for the reason of non-responsive ... we would like to notify issue it faced when we found the root cause) but, at the same time, as official documentation says <https://kubernetes.io/docs/reference/access-authn-authz/extensible-admission-controllers/#avoiding-operating-on-the-kube-system-namespace>, we should avoid using validation webhook on kube-system related operation (or some other critical components). It caused the problems which we faced. To skip the validation webhook, we can use namespaceSelector. But to use it we need the label. Since the manifest itself is free to modify so we can introduce our own defined label but I think HNC itself can provide the label for it? For example, how about introducing admission.hnc.x-k8s.io/ignore like gatekeeper project does <https://github.com/open-policy-agent/gatekeeper#exempting-namespaces-from-the-gatekeeper-admission-webhook-using---exempt-namespace-flag>? How do you think? — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#1023>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AE43PZFTV55GWSI4PLGESTTSBH5QNANCNFSM4QCWVT5Q> .

tcnksm · 2020-08-18T04:51:44Z

can you describe the circumstances when the webhook became unresponsive? Was HNC under heavy
load at the time, or was it due to some kind of outage

It was running on the experimental cluster (the version is 1.15.12-gke.9) which does not run many workloads (the number of namespaces is around 30). Since it's the experimental cluster, we don't receive any production traffic, and we don't often touch it (only when we try something new). I deployed it exactly the same way as the user guide described (no manifest modification).

Actually, cluster instability was notified by GCP support (as I described above we don't have any monitoring on this cluster). To recover it, I just uninstalled it. So I don't have much information about what happened ... I tried to search log a bit but I cloud not find any clue about this non-responsive state.

I will try to install it again later and see this happens again. Then I can tell you more information.

adrianludwin · 2020-08-18T13:13:36Z

Thanks! One more question - are you using the OSS version of HNC? If so, you can modify the validating webhook configuration to have a very short timeout (eg 2s) so that if it goes down, objects will still be allowed to go through quickly. We should probably have done that already, but I didn't realize how long the default timeout was (it's 30s since we're using the v1beta1 webhook API; this was shortened to 10s for v1).

If you're using the ConfigSync/ACM version, you won't be able to modify the webhook config directly, but since this is your test cluster, perhaps you could install the OSS version and see if the shorter timeout helps. If so, we'll get that into the next ConfigSync version.

adrianludwin · 2020-08-18T13:15:41Z

It's worth pointing out that the object webhooks are mainly there for safety, not correctness. If they don't run at all, then the object reconciler will still try to undo any changes to the cluster as quickly as it can. This won't defend against an attack, but it will be enough to stop a mistake. So having a short timeout doesn't substantially increase the risk of your cluster.

tcnksm · 2020-08-19T08:02:55Z

One more question - are you using the OSS version of HNC?

Yes, we are using the OSS version of HNC.

If so, you can modify the validating webhook configuration to have a very short timeout (eg 2s) so that if it goes down, objects will still be allowed to go through quickly.

Yeah, let us configure it in the next time.

It's worth pointing out that the object webhooks are mainly there for safety, not correctness.

Ok but I still can not understand why HNC needs to check all objects to be applied to the cluster? My understanding of HNC is it handles its CRD, Namespace, and some resources which are used for inheritance (like RBAC or NetworkPolicy).

adrianludwin · 2020-08-19T20:43:00Z

Ok but I still can not understand why HNC needs to check all objects to be applied to the cluster? My understanding of HNC is it handles its CRD, Namespace, and some resources which are used for inheritance (like RBAC or NetworkPolicy).

That's a good question. The short answer is that we could change HNC so that it only has a webhook for the types of objects it's tracking. The longer answer, and the reason we haven't done this yet, is because the list of object types can change dynamically and it was far easier to write a single static webhook configuration than it was to automatically generate all possible configurations at runtime. Our belief (which turned out to be incorrect) was that the webhook could quickly ignore all object types we didn't care about.

As a short-term fix, I think we'll change the timeout to 2s in a patch release (e.g. 0.5.2). Then there are two medium-term directions we can take:

Exclude namespaces like kube-system from the webhook. In HNCConfig, we'd maintain a list of namespaces that should always be excluded, and we'd label those namespaces with hnc.x-k8s.io/excludeNamespace. No human user would be allowed to set or remove that label, and HNC's own webhooks would not operate inside those namespaces.
Exclude object kinds from the webhook. This is somewhat more challenging since it involves writing new logic to reconcile webhooks dynamically rather than applying a single, static configuration. However, it will reduce the load on HNC and be even safer.

Ideally we'd have both, and we can probably get both done in the next few months. Do you have a preference for which you'd prefer first?

tcnksm · 2020-08-20T04:39:00Z

Ideally we'd have both, and we can probably get both done in the next few months. Do you have a preference for which you'd prefer first?

I think the priority point of view, it's important to have 1 than 2 because it's more important to avoid blocking the kube-system's critical operations. And, at the same time, I think it's also important for HNC itself. For example, if we don't exclude HNC namespace and deploy it correctly, deadlock may happen (webhook asks validation of HNC deployment, but HNC doesn't work. We faced this issue in the different controller before).

Exclude namespaces like kube-system from the webhook. In HNCConfig, we'd maintain a list of namespaces that should always be excluded, and we'd label those namespaces with hnc.x-k8s.io/excludeNamespace. No human user would be allowed to set or remove that label, and HNC's own webhooks would not operate inside those namespaces.

Does this mean HNC itself will put the label to the namespace? I'm not sure HNC should have that responsibility or not but should be helpful.

Exclude object kinds from the webhook.

I don't think we need to have a dynamic webhook configuration? What I thought is instead of writing "*" but we can write only supported resources like NetworkPolicy or ConfigMap?

adrianludwin · 2020-08-20T15:04:48Z

I think the priority point of view, it's important to have 1 than 2 because it's more important to avoid blocking the kube-system's critical operations. And, at the same time, I think it's also important for HNC itself.

Agreed on all counts, thanks.

Does this mean HNC itself will put the label to the namespace? I'm not sure HNC should have that responsibility or not but should be helpful.

I'd prefer a K8s-defined label like kube-critical but I don't think that's on the table :) Gatekeeper has a very similar solution where they label namespaces with something like "ignored-by-gatekeeper."

I don't think we need to have a dynamic webhook configuration? What I thought is instead of writing "*" but we can write only supported resources like NetworkPolicy or ConfigMap?

The problem is that users can add arbitrary types. NetworkPolicy and ConfigMap actually are not included by default - only RoleBinding and Role. But users can add any types they want, including CRDs that we (the HNC authors) don't know about.

tcnksm · 2020-08-21T01:11:14Z

But users can add any types they want, including CRDs that we (the HNC authors) don't know about.

Ah, ok now I understood it. I thought supported resources were limited. Thank you for the clarification.

tcnksm · 2020-08-21T01:14:59Z

As a short-term fix, I think we'll change the timeout to 2s in a patch release

If you are fine, I would like to contribute to this change.

adrianludwin · 2020-08-21T01:53:18Z

That would be great, thanks! You can submit the patch to the hnc-v0.5 branch as well as master, and then I'll release 0.5.2 next week.

…

On Thu, Aug 20, 2020 at 9:15 PM Taichi Nakashima ***@***.***> wrote: As a short-term fix, I think we'll change the timeout to 2s in a patch release If you are fine, I would like to contribute to this change. — You are receiving this because you were assigned. Reply to this email directly, view it on GitHub <#1023 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AE43PZB6XC6PDYQE7CXL7HDSBXDCBANCNFSM4QCWVT5Q> .

tcnksm · 2020-08-21T02:59:46Z

Sent it 👉 #1039

adrianludwin · 2020-08-25T17:23:14Z

I've been struggling to reproduce this. Simply killing the deployment or the webhook service isn't enough - in those cases, the webhook fails immediately and everything works just fine. I had to add a time.Sleep() to the webhook implementation to get this to reproduce. I suspect that the webhook only blocks during HNC startup on clusters with lots of namespaces, which explains why I've never run into this before.

adrianludwin · 2020-12-10T02:39:06Z

Looks like this will (eventually) be fixed in core K8s:

kubernetes/enhancements#2162

For now I'm going to put this on the backlog, I don't think there's an urgent issue given the changes we've made so far.

adrianludwin · 2020-12-14T20:13:18Z

I think we can fix this along with #374 as explained in that bug.

/assign @GinnyJI

fejta-bot · 2021-03-14T20:36:06Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

yiqigao217 · 2021-03-16T13:37:33Z

/remove-lifecycle stale
@adrianludwin move this to backlog together with #374?

adrianludwin · 2021-03-16T14:29:09Z

Since this was filed by @tcnksm I think it would be nice to have a go at this.

/good-first-issue

k8s-ci-robot · 2021-03-16T14:29:10Z

@adrianludwin:
This request has been marked as suitable for new contributors.

Please ensure the request meets the requirements listed here.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-good-first-issue command.

In response to this:

Since this was filed by @tcnksm I think it would be nice to have a go at this.

/good-first-issue

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

yiqigao217 · 2021-04-12T13:33:22Z

Fixed by #1444

adrianludwin changed the title ~~HNC: ValidatingWebhook should have a way to avoiding operation on specific namespaces~~ HNC: ValidatingWebhook should have a way to avoiding operation on specific namespaces and/or object kinds Aug 19, 2020

adrianludwin self-assigned this Aug 19, 2020

tcnksm mentioned this issue Aug 21, 2020

Add timeoutSeconds patch for webhook configuration #1039

Merged

adrianludwin mentioned this issue Aug 25, 2020

Add timeoutSeconds patch for webhook configuration #1048

Merged

adrianludwin added this to the hnc-v0.6 milestone Aug 25, 2020

adrianludwin modified the milestones: hnc-v0.6, HNC v0.7 Oct 16, 2020

adrianludwin modified the milestones: hnc-v0.7, hnc-backlog Dec 10, 2020

adrianludwin mentioned this issue Dec 14, 2020

HNC: make controller exclusions configurable #374

Closed

k8s-ci-robot assigned GinnyJI Dec 14, 2020

adrianludwin modified the milestones: hnc-backlog, hnc-v0.8 Dec 14, 2020

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 14, 2021

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 16, 2021

k8s-ci-robot added good first issue Denotes an issue ready for a new contributor, according to the "help wanted" guidelines. help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. labels Mar 16, 2021

yiqigao217 mentioned this issue Mar 23, 2021

Make excluded namespace configurable #1444

Merged

yiqigao217 assigned yiqigao217 and unassigned GinnyJI Apr 1, 2021

yiqigao217 closed this as completed Apr 12, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HNC: ValidatingWebhook should have a way to avoiding operation on specific namespaces and/or object kinds #1023

HNC: ValidatingWebhook should have a way to avoiding operation on specific namespaces and/or object kinds #1023

tcnksm commented Aug 18, 2020 •

edited

Loading

adrianludwin commented Aug 18, 2020 via email

tcnksm commented Aug 18, 2020 •

edited

Loading

adrianludwin commented Aug 18, 2020

adrianludwin commented Aug 18, 2020

tcnksm commented Aug 19, 2020 •

edited

Loading

adrianludwin commented Aug 19, 2020

tcnksm commented Aug 20, 2020

adrianludwin commented Aug 20, 2020

tcnksm commented Aug 21, 2020

tcnksm commented Aug 21, 2020

adrianludwin commented Aug 21, 2020 via email

tcnksm commented Aug 21, 2020 •

edited

Loading

adrianludwin commented Aug 25, 2020

adrianludwin commented Dec 10, 2020

adrianludwin commented Dec 14, 2020

fejta-bot commented Mar 14, 2021

yiqigao217 commented Mar 16, 2021

adrianludwin commented Mar 16, 2021

k8s-ci-robot commented Mar 16, 2021

yiqigao217 commented Apr 12, 2021

HNC: ValidatingWebhook should have a way to avoiding operation on specific namespaces and/or object kinds #1023

HNC: ValidatingWebhook should have a way to avoiding operation on specific namespaces and/or object kinds #1023

Comments

tcnksm commented Aug 18, 2020 • edited Loading

adrianludwin commented Aug 18, 2020 via email

tcnksm commented Aug 18, 2020 • edited Loading

adrianludwin commented Aug 18, 2020

adrianludwin commented Aug 18, 2020

tcnksm commented Aug 19, 2020 • edited Loading

adrianludwin commented Aug 19, 2020

tcnksm commented Aug 20, 2020

adrianludwin commented Aug 20, 2020

tcnksm commented Aug 21, 2020

tcnksm commented Aug 21, 2020

adrianludwin commented Aug 21, 2020 via email

tcnksm commented Aug 21, 2020 • edited Loading

adrianludwin commented Aug 25, 2020

adrianludwin commented Dec 10, 2020

adrianludwin commented Dec 14, 2020

fejta-bot commented Mar 14, 2021

yiqigao217 commented Mar 16, 2021

adrianludwin commented Mar 16, 2021

k8s-ci-robot commented Mar 16, 2021

yiqigao217 commented Apr 12, 2021

tcnksm commented Aug 18, 2020 •

edited

Loading

tcnksm commented Aug 18, 2020 •

edited

Loading

tcnksm commented Aug 19, 2020 •

edited

Loading

tcnksm commented Aug 21, 2020 •

edited

Loading