Skip to content
This repository has been archived by the owner on Jun 26, 2023. It is now read-only.

HNC: ValidatingWebhook should have a way to avoiding operation on specific namespaces and/or object kinds #1023

Closed
tcnksm opened this issue Aug 18, 2020 · 20 comments
Assignees
Labels
good first issue Denotes an issue ready for a new contributor, according to the "help wanted" guidelines. help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines.
Milestone

Comments

@tcnksm
Copy link
Contributor

tcnksm commented Aug 18, 2020

We've been using HNC for our experimental GKE environment. But because HNC validation webhook objects.hnc.x-k8s.io was often non-responsive, various system operations such as leader elections failing and this caused cluster itself unstable...

I think non-responsive itself is an issue (in this time, since it's the experimental environment, we don't have enough monitoring and clue for the reason of non-responsive ... we would like to notify issue it faced when we found the root cause) but, at the same time, as official documentation says, we should avoid using validation webhook on kube-system related operation (or some other critical components). It caused the problems which we faced.

To skip the validation webhook, we can use namespaceSelector. But to use it we need the label. Since ValidatingWebhookConfiguration manifest itself is free to modify, we can introduce our own defined label for this purpose but I think HNC itself can provide the label for it? For example, how about introducing admission.hnc.x-k8s.io/ignore like gatekeeper project does? How do you think?

@adrianludwin
Copy link
Contributor

adrianludwin commented Aug 18, 2020 via email

@tcnksm
Copy link
Contributor Author

tcnksm commented Aug 18, 2020

can you describe the circumstances when the webhook became unresponsive? Was HNC under heavy
load at the time, or was it due to some kind of outage

It was running on the experimental cluster (the version is 1.15.12-gke.9) which does not run many workloads (the number of namespaces is around 30). Since it's the experimental cluster, we don't receive any production traffic, and we don't often touch it (only when we try something new). I deployed it exactly the same way as the user guide described (no manifest modification).

Actually, cluster instability was notified by GCP support (as I described above we don't have any monitoring on this cluster). To recover it, I just uninstalled it. So I don't have much information about what happened ... I tried to search log a bit but I cloud not find any clue about this non-responsive state.

I will try to install it again later and see this happens again. Then I can tell you more information.

@adrianludwin
Copy link
Contributor

Thanks! One more question - are you using the OSS version of HNC? If so, you can modify the validating webhook configuration to have a very short timeout (eg 2s) so that if it goes down, objects will still be allowed to go through quickly. We should probably have done that already, but I didn't realize how long the default timeout was (it's 30s since we're using the v1beta1 webhook API; this was shortened to 10s for v1).

If you're using the ConfigSync/ACM version, you won't be able to modify the webhook config directly, but since this is your test cluster, perhaps you could install the OSS version and see if the shorter timeout helps. If so, we'll get that into the next ConfigSync version.

@adrianludwin
Copy link
Contributor

It's worth pointing out that the object webhooks are mainly there for safety, not correctness. If they don't run at all, then the object reconciler will still try to undo any changes to the cluster as quickly as it can. This won't defend against an attack, but it will be enough to stop a mistake. So having a short timeout doesn't substantially increase the risk of your cluster.

@tcnksm
Copy link
Contributor Author

tcnksm commented Aug 19, 2020

One more question - are you using the OSS version of HNC?

Yes, we are using the OSS version of HNC.

If so, you can modify the validating webhook configuration to have a very short timeout (eg 2s) so that if it goes down, objects will still be allowed to go through quickly.

Yeah, let us configure it in the next time.


It's worth pointing out that the object webhooks are mainly there for safety, not correctness.

Ok but I still can not understand why HNC needs to check all objects to be applied to the cluster? My understanding of HNC is it handles its CRD, Namespace, and some resources which are used for inheritance (like RBAC or NetworkPolicy).

@adrianludwin
Copy link
Contributor

Ok but I still can not understand why HNC needs to check all objects to be applied to the cluster? My understanding of HNC is it handles its CRD, Namespace, and some resources which are used for inheritance (like RBAC or NetworkPolicy).

That's a good question. The short answer is that we could change HNC so that it only has a webhook for the types of objects it's tracking. The longer answer, and the reason we haven't done this yet, is because the list of object types can change dynamically and it was far easier to write a single static webhook configuration than it was to automatically generate all possible configurations at runtime. Our belief (which turned out to be incorrect) was that the webhook could quickly ignore all object types we didn't care about.

As a short-term fix, I think we'll change the timeout to 2s in a patch release (e.g. 0.5.2). Then there are two medium-term directions we can take:

  1. Exclude namespaces like kube-system from the webhook. In HNCConfig, we'd maintain a list of namespaces that should always be excluded, and we'd label those namespaces with hnc.x-k8s.io/excludeNamespace. No human user would be allowed to set or remove that label, and HNC's own webhooks would not operate inside those namespaces.
  2. Exclude object kinds from the webhook. This is somewhat more challenging since it involves writing new logic to reconcile webhooks dynamically rather than applying a single, static configuration. However, it will reduce the load on HNC and be even safer.

Ideally we'd have both, and we can probably get both done in the next few months. Do you have a preference for which you'd prefer first?

@adrianludwin adrianludwin changed the title HNC: ValidatingWebhook should have a way to avoiding operation on specific namespaces HNC: ValidatingWebhook should have a way to avoiding operation on specific namespaces and/or object kinds Aug 19, 2020
@adrianludwin adrianludwin self-assigned this Aug 19, 2020
@tcnksm
Copy link
Contributor Author

tcnksm commented Aug 20, 2020

Ideally we'd have both, and we can probably get both done in the next few months. Do you have a preference for which you'd prefer first?

I think the priority point of view, it's important to have 1 than 2 because it's more important to avoid blocking the kube-system's critical operations. And, at the same time, I think it's also important for HNC itself. For example, if we don't exclude HNC namespace and deploy it correctly, deadlock may happen (webhook asks validation of HNC deployment, but HNC doesn't work. We faced this issue in the different controller before).


Exclude namespaces like kube-system from the webhook. In HNCConfig, we'd maintain a list of namespaces that should always be excluded, and we'd label those namespaces with hnc.x-k8s.io/excludeNamespace. No human user would be allowed to set or remove that label, and HNC's own webhooks would not operate inside those namespaces.

Does this mean HNC itself will put the label to the namespace? I'm not sure HNC should have that responsibility or not but should be helpful.

Exclude object kinds from the webhook.

I don't think we need to have a dynamic webhook configuration? What I thought is instead of writing "*" but we can write only supported resources like NetworkPolicy or ConfigMap?

@adrianludwin
Copy link
Contributor

I think the priority point of view, it's important to have 1 than 2 because it's more important to avoid blocking the kube-system's critical operations. And, at the same time, I think it's also important for HNC itself.

Agreed on all counts, thanks.

Does this mean HNC itself will put the label to the namespace? I'm not sure HNC should have that responsibility or not but should be helpful.

I'd prefer a K8s-defined label like kube-critical but I don't think that's on the table :) Gatekeeper has a very similar solution where they label namespaces with something like "ignored-by-gatekeeper."

I don't think we need to have a dynamic webhook configuration? What I thought is instead of writing "*" but we can write only supported resources like NetworkPolicy or ConfigMap?

The problem is that users can add arbitrary types. NetworkPolicy and ConfigMap actually are not included by default - only RoleBinding and Role. But users can add any types they want, including CRDs that we (the HNC authors) don't know about.

@tcnksm
Copy link
Contributor Author

tcnksm commented Aug 21, 2020

But users can add any types they want, including CRDs that we (the HNC authors) don't know about.

Ah, ok now I understood it. I thought supported resources were limited. Thank you for the clarification.

@tcnksm
Copy link
Contributor Author

tcnksm commented Aug 21, 2020

As a short-term fix, I think we'll change the timeout to 2s in a patch release

If you are fine, I would like to contribute to this change.

@adrianludwin
Copy link
Contributor

adrianludwin commented Aug 21, 2020 via email

@tcnksm
Copy link
Contributor Author

tcnksm commented Aug 21, 2020

Sent it 👉 #1039

@adrianludwin
Copy link
Contributor

I've been struggling to reproduce this. Simply killing the deployment or the webhook service isn't enough - in those cases, the webhook fails immediately and everything works just fine. I had to add a time.Sleep() to the webhook implementation to get this to reproduce. I suspect that the webhook only blocks during HNC startup on clusters with lots of namespaces, which explains why I've never run into this before.

@adrianludwin
Copy link
Contributor

Looks like this will (eventually) be fixed in core K8s:

kubernetes/enhancements#2162

For now I'm going to put this on the backlog, I don't think there's an urgent issue given the changes we've made so far.

@adrianludwin
Copy link
Contributor

I think we can fix this along with #374 as explained in that bug.

/assign @GinnyJI

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 14, 2021
@yiqigao217
Copy link
Contributor

/remove-lifecycle stale
@adrianludwin move this to backlog together with #374?

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 16, 2021
@adrianludwin
Copy link
Contributor

Since this was filed by @tcnksm I think it would be nice to have a go at this.

/good-first-issue

@k8s-ci-robot
Copy link
Contributor

@adrianludwin:
This request has been marked as suitable for new contributors.

Please ensure the request meets the requirements listed here.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-good-first-issue command.

In response to this:

Since this was filed by @tcnksm I think it would be nice to have a go at this.

/good-first-issue

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added good first issue Denotes an issue ready for a new contributor, according to the "help wanted" guidelines. help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. labels Mar 16, 2021
@yiqigao217 yiqigao217 assigned yiqigao217 and unassigned GinnyJI Apr 1, 2021
@yiqigao217
Copy link
Contributor

Fixed by #1444

# for free to subscribe to this conversation on GitHub. Already have an account? #.
Labels
good first issue Denotes an issue ready for a new contributor, according to the "help wanted" guidelines. help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines.
Projects
None yet
Development

No branches or pull requests

6 participants