feat: Adding checksum annotations to the clustershield deployment #1945

yoderme · 2024-09-13T19:07:06Z

Imagine the following scenario: the clustershield helm chart is re-deployed and the contents of either its configmap, secrets, or webhook change... and the clustershield deployment itself does not change. The clustershield deployment won't restart because it does not know that anything has changed - even though the things it depended on change.

This problem is especially pronounced for clustershield because with the default behavior it will auto-generate the certificates used for communication between the webhook and the deployment. If the webhook certificate changes and the deployment doesn't restart, the webhook won't be able to communicate with the pods in the deployment.

The canonical way to address this is to put checksums of the depended-upon templates into the deployment as annotations. Now when resources the deployment depend upon change (like the auto-genereted certificates) the clustershield pods will restart.

This is the same solution implemented in the nodeanalyzer daemonset.

What this PR does / why we need it:

Checklist

Title of the PR starts with type and scope, (e.g. feat(agent,node-analyzer,sysdig-deploy):)
Chart Version bumped for the respective charts
Variables are documented in the README.md (or README.tpl in some charts)
Check GithubAction checks (like lint) to avoid merge-check stoppers
All test files are added in the tests folder of their respective chart and have a "_test" suffix

Imagine the following scenario: the clustershield helm chart is re-deployed and the contents of either its configmap, secrets, or webhook change... and the clustershield deployment itself does not change. The clustershield deployment won't restart because it does not know that anything has changed - even though the things it depended on change. This problem is especially pronounced for clustershield because with the default behavior it will auto-generate the certificates used for communication between the webhook and the deployment. If the webhook certificate changes and the deployment doesn't restart, the webhook won't be able to communicate with the pods in the deployment. The canonical way to address this is to put checksums of the depended-upon templates into the deployment as annotations. Now when resources the deployment depend upon change (like the auto-genereted certificates) the clustershield pods will restart. This is the same solution implemented in the nodeanalyzer daemonset.

yoderme · 2024-09-13T19:07:16Z

#1944

github-actions · 2024-09-13T19:07:20Z

Hi @yoderme. Thanks for your PR.

After inspecting your changes someone with write access to this repo needs
to approve and run the workflow.

charts/cluster-shield/Chart.yaml

mavimo · 2024-09-16T12:54:16Z

@yoderme

Imagine the following scenario: the clustershield helm chart is re-deployed and the contents of either its configmap, secrets, or webhook change... and the clustershield deployment itself does not change. The clustershield deployment won't restart because it does not know that anything has changed - even though the things it depended on change.

The Cluster Shield should notice changes in the configmap and secrets and it should pick up the changes in the configuration. It may take some time (as kubernetes will update changes in the pod with some delay) but hash should not be required for that.

Do you have a situation where this is not happening?

yoderme · 2024-09-16T17:29:34Z

Imagine the following scenario: the clustershield helm chart is re-deployed and the contents of either its configmap, secrets, or webhook change... and the clustershield deployment itself does not change. The clustershield deployment won't restart because it does not know that anything has changed - even though the things it depended on change.

The Cluster Shield should notice changes in the configmap and secrets and it should pick up the changes in the configuration. It may take some time (as kubernetes will update changes in the pod with some delay) but hash should not be required for that.

Do you have a situation where this is not happening?

Thanks for the reply @mavimo !

I read this and said to myself "I've never seen this not happen" and scratched my head for awhile. I am not sure if the clustershield pod will notice the configmap change (I didn't test it), but it definitely won't notice it when the TLS certificates change.

So I went off to prove this to myself. I did a helm install

helm upgrade --install sysdig sysdig/sysdig-deploy -n sysdig --reuse-values --version 1.65.2

and then ran the same command again. Nothing restarted... but I also noticed that the certificates also did not change. The sysdig-clustershield-tls-certs secrets didn't change, and the validating webhook's caBundle didn't change. This puzzled me until I realized that the helm function cluster-shield.tlsGenCerts does a lookup() of the certificate and uses it if it exists. This is clever and avoids the problem.

HOWEVER, if one were to delete sysdig-clustershield-tls-certs and then run the helm upgrade above again, the TLS certs would change, the validating webhook would change, and the pods would not restart and would not pick up the change. This leads to lots of the following message in the kube api server logs

ailed calling webhook "audit.secure.sysdig.com": failed to call webhook: Post "https://sysdig-clustershield.sysdig.svc:6443/k8s-audit?timeout=5s": tls: failed to verify certificate: x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "sysdig-clustershield")

The way to get the pods to notice this change is the annotations with the hash.

This is actually a problem for me. I'm using kustomize basically like so:

helmCharts:
  - name: sysdig-deploy
    repo: https://charts.sysdig.com
    version: 1.62.1
    releaseName: sysdig
    namespace: sysdig
    valuesFile: values.yaml

On the back end, what happens here is that kustomize calls helm template to create a bunch of yaml and then applies that yaml. Because it runs helm template and there is no cluster info given, the TLS cert will change every time, and ... you see where this is headed.

Co-authored-by: Marco Vito Moscaritolo <mavimo@gmail.com>

mavimo · 2024-09-17T18:28:24Z

@yoderme thanks for the clarification.

I understand your use case, I'll check but secrets and configmap hashes shouldn't be required (the Cluster Shiedl is running a watch process on the files, and restarting the pod isn't really required).

As for the webhook certificate, I'll check what are the implications are and get back to you.
PS: If you want, you can use an external secret for the cert so that the helm does not create a new cert on every release.

yoderme · 2024-09-18T15:57:08Z

I understand your use case, I'll check but secrets and configmap hashes shouldn't be required (the Cluster Shiedl is running a watch process on the files, and restarting the pod isn't really required).

Admittedly I have not checked what happens when these change, so OK. It's the webhook cert that's been the problem for me.

As for the webhook certificate, I'll check what are the implications are and get back to you. PS: If you want, you can use an external secret for the cert so that the helm does not create a new cert on every release.

Understood about using an external secret. Plausible, but also a bit of a pain. Thanks @mavimo for looking at this. If it's just the web hook cert that's a problem I could change the PR to just have the clustershield pods restart on changes to the validating webhook. Otherwise you'll have to monitor/restart internally in the clustershield pod.

mavimo · 2024-09-18T21:17:38Z

@yoderme we are doing some internal checks as this may have some implications forcing cert changes which may cause problems in case of re-deployments.
I want to assure you that we are looking into this, but it may take some extra time, sorry for that! We will come back to this PR once we have checked some edge cases and based on our research we may ask you to make some changes to the PR, in the meantime thanks for your patience!

yoderme · 2024-09-19T15:19:05Z

@mavimo understood and thanks for working on this!

yoderme · 2024-09-20T16:26:54Z

I was thinking about this, and realized that the really cool way to deal with this entire problem is to not set up the webhook certificate or the validatingwebhook in helm at all. Instead, when the cluster shield process starts up, have it create a self signed certificate and then talk to the k8s server to create the validating webhook using that cert. Then on process shutdown remove the validating webhook. The cluster shield service account will have to have the rbac k8s ability to do this, but that's not hard. It's a big change from where you are now though. Food for thought.

mavimo · 2024-09-24T21:55:46Z

@yoderme thanks to investing your time thinking about this. I see some drawbacks to what you are suggesting, or at least to making it the default approach:

user who rely on external certificate generation (e.g. because they have cert manager or other tools) and then link the chart may not be able to get it generated outiside with their on CA
we need to do some extra operation to validate that -in case of autoscaling or multiple replicas- this is only doneby one instance.
we may have a scale-down operation so one instance will be terminated, but should not remove the cert/webhook as other instances are still running
in case of failure we are not be able to do the proper cleanup if we have the cluster with "unmanaged" resources
for people doing gitops, they will have resources in the cluster that are not defined in the code, causing drift

OFC all these issues can be solved, but we are also investigating to have the hot-reload for certificate as we do for secrets and configmap, but we are not sure if this can be done in a way that does not generate drawbacks...

Some people in the team are joining the investigation, we will post some update here soon

yoderme · 2024-09-27T16:07:51Z

I see some drawbacks

It's not perfect, but it does have advantages. To your point "user who rely on external certificate generation" - you don't have to tell anyone this certificate exists. ;-). It's only for the purposes of communication between the api server and the pod, so it's really hidden.

For an example of an open source project that takes this approach, see Kyverno. https://github.com/kyverno/kyverno

AlbertoBarba · 2024-10-08T14:19:15Z

Hi @yoderme, i did perform some tests on your PR and everything seems to be fine.
The cluster-shield application is able to recognise when the configmap or the secrets have been changed, so there should be no need to have the checksum/configmap and checksum/secrets in place.
I'll only keep the checksum/webhook to allow the reload of the certificates.

mavimo · 2024-10-21T07:36:21Z

@yoderme I have a branch where I rebased and make the changes to support webhook, if you can grant me access to the PR I'll push my update on your branch

yoderme · 2024-10-25T01:20:38Z

@yoderme I have a branch where I rebased and make the changes to support webhook, if you can grant me access to the PR I'll push my update on your branch

I added you to my repo. You may just have to take my changes and submit them yourself; that's fine with me, no worries.

mavimo · 2024-10-25T09:37:27Z

You may just have to take my changes and submit them yourself; that's fine with me, no worries

@yoderme I'll do (I'll try to keep commit attribution to you)

yoderme mentioned this pull request Sep 13, 2024

sysdig-clustershield deployment does not restart when configmap, secrets, or the webhook change. #1944

Open

mavimo reviewed Sep 16, 2024

View reviewed changes

charts/cluster-shield/Chart.yaml Outdated Show resolved Hide resolved

Update charts/cluster-shield/Chart.yaml

cdc92c6

Co-authored-by: Marco Vito Moscaritolo <mavimo@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Adding checksum annotations to the clustershield deployment #1945

feat: Adding checksum annotations to the clustershield deployment #1945

yoderme commented Sep 13, 2024

yoderme commented Sep 13, 2024

github-actions bot commented Sep 13, 2024

mavimo commented Sep 16, 2024

yoderme commented Sep 16, 2024

mavimo commented Sep 17, 2024

yoderme commented Sep 18, 2024

mavimo commented Sep 18, 2024

yoderme commented Sep 19, 2024

yoderme commented Sep 20, 2024

mavimo commented Sep 24, 2024

yoderme commented Sep 27, 2024

AlbertoBarba commented Oct 8, 2024

mavimo commented Oct 21, 2024

yoderme commented Oct 25, 2024

mavimo commented Oct 25, 2024

feat: Adding checksum annotations to the clustershield deployment #1945

Are you sure you want to change the base?

feat: Adding checksum annotations to the clustershield deployment #1945

Conversation

yoderme commented Sep 13, 2024

What this PR does / why we need it:

Checklist

yoderme commented Sep 13, 2024

github-actions bot commented Sep 13, 2024

mavimo commented Sep 16, 2024

yoderme commented Sep 16, 2024

mavimo commented Sep 17, 2024

yoderme commented Sep 18, 2024

mavimo commented Sep 18, 2024

yoderme commented Sep 19, 2024

yoderme commented Sep 20, 2024

mavimo commented Sep 24, 2024

yoderme commented Sep 27, 2024

AlbertoBarba commented Oct 8, 2024

mavimo commented Oct 21, 2024

yoderme commented Oct 25, 2024

mavimo commented Oct 25, 2024