Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

feat: Adding checksum annotations to the clustershield deployment #1945

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

yoderme
Copy link

@yoderme yoderme commented Sep 13, 2024

Imagine the following scenario: the clustershield helm chart is re-deployed and the contents of either its configmap, secrets, or webhook change... and the clustershield deployment itself does not change. The clustershield deployment won't restart because it does not know that anything has changed - even though the things it depended on change.

This problem is especially pronounced for clustershield because with the default behavior it will auto-generate the certificates used for communication between the webhook and the deployment. If the webhook certificate changes and the deployment doesn't restart, the webhook won't be able to communicate with the pods in the deployment.

The canonical way to address this is to put checksums of the depended-upon templates into the deployment as annotations. Now when resources the deployment depend upon change (like the auto-genereted certificates) the clustershield pods will restart.

This is the same solution implemented in the nodeanalyzer daemonset.

What this PR does / why we need it:

Checklist

  • Title of the PR starts with type and scope, (e.g. feat(agent,node-analyzer,sysdig-deploy):)
  • Chart Version bumped for the respective charts
  • Variables are documented in the README.md (or README.tpl in some charts)
  • Check GithubAction checks (like lint) to avoid merge-check stoppers
  • All test files are added in the tests folder of their respective chart and have a "_test" suffix

Imagine the following scenario: the clustershield helm chart is
re-deployed and the contents of either its configmap, secrets, or
webhook change... and the clustershield deployment itself does not
change. The clustershield deployment won't restart because it does not
know that anything has changed - even though the things it depended on
change.

This problem is especially pronounced for clustershield because with
the default behavior it will auto-generate the certificates used for
communication between the webhook and the deployment. If the webhook
certificate changes and the deployment doesn't restart, the webhook
won't be able to communicate with the pods in the deployment.

The canonical way to address this is to put checksums of the
depended-upon templates into the deployment as annotations. Now when
resources the deployment depend upon change (like the auto-genereted
certificates) the clustershield pods will restart.

This is the same solution implemented in the nodeanalyzer daemonset.
@yoderme
Copy link
Author

yoderme commented Sep 13, 2024

#1944

Copy link
Contributor

Hi @yoderme. Thanks for your PR.

After inspecting your changes someone with write access to this repo needs
to approve and run the workflow.

@mavimo
Copy link
Contributor

mavimo commented Sep 16, 2024

@yoderme

Imagine the following scenario: the clustershield helm chart is re-deployed and the contents of either its configmap, secrets, or webhook change... and the clustershield deployment itself does not change. The clustershield deployment won't restart because it does not know that anything has changed - even though the things it depended on change.

The Cluster Shield should notice changes in the configmap and secrets and it should pick up the changes in the configuration. It may take some time (as kubernetes will update changes in the pod with some delay) but hash should not be required for that.

Do you have a situation where this is not happening?

@yoderme
Copy link
Author

yoderme commented Sep 16, 2024

Imagine the following scenario: the clustershield helm chart is re-deployed and the contents of either its configmap, secrets, or webhook change... and the clustershield deployment itself does not change. The clustershield deployment won't restart because it does not know that anything has changed - even though the things it depended on change.

The Cluster Shield should notice changes in the configmap and secrets and it should pick up the changes in the configuration. It may take some time (as kubernetes will update changes in the pod with some delay) but hash should not be required for that.

Do you have a situation where this is not happening?

Thanks for the reply @mavimo !

I read this and said to myself "I've never seen this not happen" and scratched my head for awhile. I am not sure if the clustershield pod will notice the configmap change (I didn't test it), but it definitely won't notice it when the TLS certificates change.

So I went off to prove this to myself. I did a helm install

helm upgrade --install sysdig sysdig/sysdig-deploy -n sysdig --reuse-values --version 1.65.2

and then ran the same command again. Nothing restarted... but I also noticed that the certificates also did not change. The sysdig-clustershield-tls-certs secrets didn't change, and the validating webhook's caBundle didn't change. This puzzled me until I realized that the helm function cluster-shield.tlsGenCerts does a lookup() of the certificate and uses it if it exists. This is clever and avoids the problem.

HOWEVER, if one were to delete sysdig-clustershield-tls-certs and then run the helm upgrade above again, the TLS certs would change, the validating webhook would change, and the pods would not restart and would not pick up the change. This leads to lots of the following message in the kube api server logs

ailed calling webhook "audit.secure.sysdig.com": failed to call webhook: Post "https://sysdig-clustershield.sysdig.svc:6443/k8s-audit?timeout=5s": tls: failed to verify certificate: x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "sysdig-clustershield")

The way to get the pods to notice this change is the annotations with the hash.

This is actually a problem for me. I'm using kustomize basically like so:

helmCharts:
  - name: sysdig-deploy
    repo: https://charts.sysdig.com
    version: 1.62.1
    releaseName: sysdig
    namespace: sysdig
    valuesFile: values.yaml

On the back end, what happens here is that kustomize calls helm template to create a bunch of yaml and then applies that yaml. Because it runs helm template and there is no cluster info given, the TLS cert will change every time, and ... you see where this is headed.

Co-authored-by: Marco Vito Moscaritolo <mavimo@gmail.com>
@mavimo
Copy link
Contributor

mavimo commented Sep 17, 2024

@yoderme thanks for the clarification.

I understand your use case, I'll check but secrets and configmap hashes shouldn't be required (the Cluster Shiedl is running a watch process on the files, and restarting the pod isn't really required).

As for the webhook certificate, I'll check what are the implications are and get back to you.
PS: If you want, you can use an external secret for the cert so that the helm does not create a new cert on every release.

@yoderme
Copy link
Author

yoderme commented Sep 18, 2024

I understand your use case, I'll check but secrets and configmap hashes shouldn't be required (the Cluster Shiedl is running a watch process on the files, and restarting the pod isn't really required).

Admittedly I have not checked what happens when these change, so OK. It's the webhook cert that's been the problem for me.

As for the webhook certificate, I'll check what are the implications are and get back to you. PS: If you want, you can use an external secret for the cert so that the helm does not create a new cert on every release.

Understood about using an external secret. Plausible, but also a bit of a pain. Thanks @mavimo for looking at this. If it's just the web hook cert that's a problem I could change the PR to just have the clustershield pods restart on changes to the validating webhook. Otherwise you'll have to monitor/restart internally in the clustershield pod.

@mavimo
Copy link
Contributor

mavimo commented Sep 18, 2024

@yoderme we are doing some internal checks as this may have some implications forcing cert changes which may cause problems in case of re-deployments.
I want to assure you that we are looking into this, but it may take some extra time, sorry for that! We will come back to this PR once we have checked some edge cases and based on our research we may ask you to make some changes to the PR, in the meantime thanks for your patience!

@yoderme
Copy link
Author

yoderme commented Sep 19, 2024

@mavimo understood and thanks for working on this!

@yoderme
Copy link
Author

yoderme commented Sep 20, 2024

I was thinking about this, and realized that the really cool way to deal with this entire problem is to not set up the webhook certificate or the validatingwebhook in helm at all. Instead, when the cluster shield process starts up, have it create a self signed certificate and then talk to the k8s server to create the validating webhook using that cert. Then on process shutdown remove the validating webhook. The cluster shield service account will have to have the rbac k8s ability to do this, but that's not hard. It's a big change from where you are now though. Food for thought.

@mavimo
Copy link
Contributor

mavimo commented Sep 24, 2024

@yoderme thanks to investing your time thinking about this. I see some drawbacks to what you are suggesting, or at least to making it the default approach:

  • user who rely on external certificate generation (e.g. because they have cert manager or other tools) and then link the chart may not be able to get it generated outiside with their on CA
  • we need to do some extra operation to validate that -in case of autoscaling or multiple replicas- this is only doneby one instance.
  • we may have a scale-down operation so one instance will be terminated, but should not remove the cert/webhook as other instances are still running
  • in case of failure we are not be able to do the proper cleanup if we have the cluster with "unmanaged" resources
  • for people doing gitops, they will have resources in the cluster that are not defined in the code, causing drift

OFC all these issues can be solved, but we are also investigating to have the hot-reload for certificate as we do for secrets and configmap, but we are not sure if this can be done in a way that does not generate drawbacks...

Some people in the team are joining the investigation, we will post some update here soon

@yoderme
Copy link
Author

yoderme commented Sep 27, 2024

I see some drawbacks

It's not perfect, but it does have advantages. To your point "user who rely on external certificate generation" - you don't have to tell anyone this certificate exists. ;-). It's only for the purposes of communication between the api server and the pod, so it's really hidden.

For an example of an open source project that takes this approach, see Kyverno. https://github.com/kyverno/kyverno

@AlbertoBarba
Copy link
Contributor

Hi @yoderme, i did perform some tests on your PR and everything seems to be fine.
The cluster-shield application is able to recognise when the configmap or the secrets have been changed, so there should be no need to have the checksum/configmap and checksum/secrets in place.
I'll only keep the checksum/webhook to allow the reload of the certificates.

@mavimo
Copy link
Contributor

mavimo commented Oct 21, 2024

@yoderme I have a branch where I rebased and make the changes to support webhook, if you can grant me access to the PR I'll push my update on your branch

@yoderme
Copy link
Author

yoderme commented Oct 25, 2024

@yoderme I have a branch where I rebased and make the changes to support webhook, if you can grant me access to the PR I'll push my update on your branch

I added you to my repo. You may just have to take my changes and submit them yourself; that's fine with me, no worries.

@mavimo
Copy link
Contributor

mavimo commented Oct 25, 2024

You may just have to take my changes and submit them yourself; that's fine with me, no worries

@yoderme I'll do (I'll try to keep commit attribution to you)

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants