-
Notifications
You must be signed in to change notification settings - Fork 127
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
feat: Adding checksum annotations to the clustershield deployment #1945
base: main
Are you sure you want to change the base?
Conversation
Imagine the following scenario: the clustershield helm chart is re-deployed and the contents of either its configmap, secrets, or webhook change... and the clustershield deployment itself does not change. The clustershield deployment won't restart because it does not know that anything has changed - even though the things it depended on change. This problem is especially pronounced for clustershield because with the default behavior it will auto-generate the certificates used for communication between the webhook and the deployment. If the webhook certificate changes and the deployment doesn't restart, the webhook won't be able to communicate with the pods in the deployment. The canonical way to address this is to put checksums of the depended-upon templates into the deployment as annotations. Now when resources the deployment depend upon change (like the auto-genereted certificates) the clustershield pods will restart. This is the same solution implemented in the nodeanalyzer daemonset.
Hi @yoderme. Thanks for your PR. After inspecting your changes someone with write access to this repo needs |
The Cluster Shield should notice changes in the configmap and secrets and it should pick up the changes in the configuration. It may take some time (as kubernetes will update changes in the pod with some delay) but hash should not be required for that. Do you have a situation where this is not happening? |
Thanks for the reply @mavimo ! I read this and said to myself "I've never seen this not happen" and scratched my head for awhile. I am not sure if the clustershield pod will notice the configmap change (I didn't test it), but it definitely won't notice it when the TLS certificates change. So I went off to prove this to myself. I did a helm install
and then ran the same command again. Nothing restarted... but I also noticed that the certificates also did not change. The HOWEVER, if one were to delete
The way to get the pods to notice this change is the annotations with the hash. This is actually a problem for me. I'm using kustomize basically like so:
On the back end, what happens here is that kustomize calls |
Co-authored-by: Marco Vito Moscaritolo <mavimo@gmail.com>
@yoderme thanks for the clarification. I understand your use case, I'll check but secrets and configmap hashes shouldn't be required (the Cluster Shiedl is running a watch process on the files, and restarting the pod isn't really required). As for the webhook certificate, I'll check what are the implications are and get back to you. |
Admittedly I have not checked what happens when these change, so OK. It's the webhook cert that's been the problem for me.
Understood about using an external secret. Plausible, but also a bit of a pain. Thanks @mavimo for looking at this. If it's just the web hook cert that's a problem I could change the PR to just have the clustershield pods restart on changes to the validating webhook. Otherwise you'll have to monitor/restart internally in the clustershield pod. |
@yoderme we are doing some internal checks as this may have some implications forcing cert changes which may cause problems in case of re-deployments. |
@mavimo understood and thanks for working on this! |
I was thinking about this, and realized that the really cool way to deal with this entire problem is to not set up the webhook certificate or the validatingwebhook in helm at all. Instead, when the cluster shield process starts up, have it create a self signed certificate and then talk to the k8s server to create the validating webhook using that cert. Then on process shutdown remove the validating webhook. The cluster shield service account will have to have the rbac k8s ability to do this, but that's not hard. It's a big change from where you are now though. Food for thought. |
@yoderme thanks to investing your time thinking about this. I see some drawbacks to what you are suggesting, or at least to making it the default approach:
OFC all these issues can be solved, but we are also investigating to have the hot-reload for certificate as we do for secrets and configmap, but we are not sure if this can be done in a way that does not generate drawbacks... Some people in the team are joining the investigation, we will post some update here soon |
It's not perfect, but it does have advantages. To your point "user who rely on external certificate generation" - you don't have to tell anyone this certificate exists. ;-). It's only for the purposes of communication between the api server and the pod, so it's really hidden. For an example of an open source project that takes this approach, see Kyverno. https://github.com/kyverno/kyverno |
Hi @yoderme, i did perform some tests on your PR and everything seems to be fine. |
@yoderme I have a branch where I rebased and make the changes to support webhook, if you can grant me access to the PR I'll push my update on your branch |
I added you to my repo. You may just have to take my changes and submit them yourself; that's fine with me, no worries. |
@yoderme I'll do (I'll try to keep commit attribution to you) |
Imagine the following scenario: the clustershield helm chart is re-deployed and the contents of either its configmap, secrets, or webhook change... and the clustershield deployment itself does not change. The clustershield deployment won't restart because it does not know that anything has changed - even though the things it depended on change.
This problem is especially pronounced for clustershield because with the default behavior it will auto-generate the certificates used for communication between the webhook and the deployment. If the webhook certificate changes and the deployment doesn't restart, the webhook won't be able to communicate with the pods in the deployment.
The canonical way to address this is to put checksums of the depended-upon templates into the deployment as annotations. Now when resources the deployment depend upon change (like the auto-genereted certificates) the clustershield pods will restart.
This is the same solution implemented in the nodeanalyzer daemonset.
What this PR does / why we need it:
Checklist
feat(agent,node-analyzer,sysdig-deploy):
)