Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Enhancement to outline path for network policies for all core components #1720

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

knobunc
Copy link
Contributor

@knobunc knobunc commented Nov 26, 2024

This enhancement outlines how we will add network policies to all core components in OpenShift. It also outlines how it will eventually become enforcing, and how we will test compliance.

@openshift-ci openshift-ci bot requested review from danwinship and trozet November 26, 2024 17:38
@knobunc knobunc changed the title Enhancement to outline path for network policies for all core components WIP Enhancement to outline path for network policies for all core components Nov 26, 2024
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 26, 2024
Comment on lines 47 to 49
- As an administrator, I need to be able to override specific
OpenShift policies to be more restrictive so that I can satisfy my
security department
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the most part, this should not be necessary. If a given restriction is safe to apply, then it should be part of the default policy. And in that case, anything "more restrictive" would be something that would cause cluster functionality to break in some way. IMO we shouldn't allow admins to break functionality they aren't using, we should allow them to disable that functionality instead (and then they don't need to add NetworkPolicies restricting its use).

One specific case I can think of where this might apply is admission control webhooks / validation webhooks. The apiserver needs to reach out to any namespace that contains such a webhook, so by default, we'd have to allow the apiserver egress access to all namespaces (or at least, all "system" namespaces?). Administrators might want to disable that. But should that be expressed as "kube-apiserver can reach all namespaces by default but administrators can create additional NetworkPolicies to restrict that", or should we instead have kube-apiserver-operator just automatically add NPs for all and only the namespaces that contain webhooks? Or should we have a rule like "webhooks can only exist in namespaces with the label foo"?

(Also, FTR, there is no way to implement "kube-apiserver can reach all namespaces by default but administrators can create additional NetworkPolicies to restrict that" given the semantics of NetworkPolicy; we would have to say that administrators are allowed to delete the "kube-apiserver can reach all namespaces" NetworkPolicy.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want administrators to create AdminNetworkPolicy to express their exceptions (where they can deny).

Ultimately, I would love for operators to have a way to have config that allows them to create policy to restrict things like webhooks. They may need an allow-list of valid domains, or something.

But, for the shorter term, we already have people deploying their own Network Policy (not admin, it wasn't ready then) to restrict our stuff since we have no policy today.

1. Change the namespace admission controller so all namespaces with an
openshift- prefix (since those are special anyway) are labeled with
the “openshift-namespace’ label so **that network policy can address
them**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I always forget what the rule about OLM operators and openshift- is, so maybe you should explain that somewhere. (In particular, it seems from some of the other comments that OLM operators can make openshift- namespaces, but then how does that work with "(Not) Adding policies for operators not created by Red Hat"?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rule is horrible.

They get installed into a default openshift-operators namespace, but the OLM will need to own the policy for that namespace. I suspect it will need to be wildcard allow egress and ingress.

But, worse, the admin can chose any namespace to add it to... there are ongoing conversations about how to deal with this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Former OLM team member here (recently moved to control plane group) 👋

OLM installed operators probably won't be able to specify NetworkPolicy resources in their bundles according to https://olm.operatorframework.io/docs/tasks/creating-operator-manifests/#packaging-additional-objects-alongside-an-operator

OLMv1 is supposed to GA in 4.18 and will, initially, respect a subset of the existing bundle format. The desire is to eventually support bundling formats that allows operator authors more control as to the contents packaged within their bundle (OLMv1 is going to be unopinionated in what is in the bundles) meaning operator authors will eventually be able to specify NetworkPolicy resources in their bundles.

With the current bundling format it seems like OLM would need to be updated to either:

  1. Allow NetworkPolicy resources in the bundle alongside the CSV. IIUC, to achieve compliance with the high-level goal of this proposal this would likely mean updating all bundles in the catalogs to contain a sufficient NetworkPolicy resource. Some open questions for this approach:
    • Do we have enough control of the catalog curation pipelines to ensure this is done properly?
    • Should a wildcard NetworkPolicy be added to all bundles by default if the authors of the operator can't/won't add one themselves?
  2. Stamp out a permissive NetworkPolicy resource on each operator installation.

With the focus on OLMv1 and some general attrition, there are not many folks on the OLM team left with knowledge of the legacy "OLMv0". Both of these are likely not trivial asks of the OLM team.

I do wonder if having CVO manage a default NetworkPolicy for the openshift-operators namespace, that is permissive enough for all operators installed in that namespace, is sufficient enough to achieve the compliance we are looking for even if admins can install an operator in any namespace?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👀

Operators running in the openshift-operators namespace may need network access to just about anything, so I agree with @knobunc that OLM will likely end up needing to ship with an allow-all policy.

Beyond that, I am pretty leery of any solution that requires operators to change their bundle contents, as this will effectively block the use of older versions of $everything.

I'm not clear on whether there's a mechanism that allows the cluster admin to install/override/configure the NetworkPolicy in an impacted namespace. I see a knob to shut the whole thing down. And this section hints at it, but I can't tell what actor (cluster admin or developer) or environment (dev/test or production) it's talking about. Anyway, it seems like a per-namespace "escape hatch" would help mitigate this issue, as it would allow operators to publish documentation describing how to retrofit policies for legacy versions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cluster admin can always create Admin Network Policies to override the Network Policies. They could also add a Network Policy to a namespace, but I recommend against that because the NPs are purely additive, there is no way to deny something added.

Not sure what you mean by "Beyond that, I am pretty leery of any solution that requires operators to change their bundle contents, as this will effectively block the use of older versions of $everything."

Can you clarify your concern?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Scenario: Customer is currently making use of some number of operators that were built/published without embedded NetworkPolicyz, either because they are versions that were built prior to this effort, or because the authors simply didn't. Customer upgrades their cluster to a version that enforces the requirement for NetworkPolicyz for everything running in openshift-* namespaces. Operators stop working because they don't have 'em. Customer is sad.

Note that I hedged my concern with speculations on how the admin could override this behavior. If there's a relatively easy way for them to do that, it might be okay. But of course there's the risk that they'll never go back and fix it up even if/when that becomes possible. (My oldest and wisest son likes to say, "There's nothing more permanent than a temporary solution.")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a section about it. In general, I see this as a decision that the OLM team should own, and they can work out how they want to handle migration to a default deny world for all of the operators.

However, they are somewhat of a unique case, and I have had a few conversations with @joelanford about how to handle it, so I feel comfortable adding a section to this enhancement.


We also need to provide this information to the support organization
in a must-gather. We will augment the must-gather tool to include
information about network policy blocks in OpenShift namespaces.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that sufficiently-broken policies could interfere with both the Network Observability tool and must-gather.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hopefully we can catch at least some of that in automated testing. I am assuming that these are the RH shipped policies, not random ones that a customer applied to us.

…nents

This enhancement outlines how we will add network policies to all core
components in OpenShift.  It also outlines how it will eventually
become enforcing, and how we will test compliance.
First round of updates in response to review comments, and some other
clean-up.
@knobunc knobunc force-pushed the core-network-policies branch from 0244f94 to 25b3070 Compare December 3, 2024 18:57
@openshift-bot
Copy link

Inactive enhancement proposals go stale after 28d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle stale.
Stale proposals rot after an additional 7d of inactivity and eventually close.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 1, 2025

## Proposal

1. Change the namespace admission controller so all namespaces with an
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A new, openshift/kubernetes admission plugin is probably a simpler implementation. This will not work on HCP.

Use ValidatingAdmissionPolicy to prevent the removal of the label on openshift-* namespaces.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume you mean that we would add a new default mutating admission plugin (and not as a webhook?)

And I took your comment to mean that my proposal would not work with HCP and that is why we should use the admission plugin.

Comment on lines +110 to +112
3. Work out how we can identify connections blocked by policy in
OpenShift namespaces (either egress or ingress) and work out how to
include it in a must-gather
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Critical feature for development and field debugging. Very glad to see it here.

Comment on lines +168 to +169
4. If it does not start with `openshift-` then any
`security.openshift.io/openshift-namespace` label will be stripped out and can not be set
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't mutate the input, but use ValidatingAdmissionPolicy to cause creation to fail to prevent this in newer clusters. Having the cluster-network-operator detect a violation of this rule and prevent upgrade until the cluster-admin corrects the problem is a good idea.

3. Otherwise it will strip the `security.openshift.io/openshift-namespace` label

TODO: Decide if this is the correct behavior... do we want to allow
namespaces to opt-in to being part of the platform. I do not think
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our platform namespaces are openshift-* namespaces. That's how things like monitoring recognize them. To be a part of the platform requires using that prefix.

1. Document what traffic flows (ingress and egress) need to be
allowed for pods in the namespaces

2. The ACS product can be used to analyze workloads in a cluster to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't deploy ACS in CI. When ACS isn't installed, what options are available?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is purely for the developers who do not know what their network flows are (which appears to be the case sometimes, even though I find that surprising). It would not be something end-users would need to do. And most of the time I don't think developers would need it.

trying to talk to whom and mint tight network policies to reflect
that model.

4. The Network Observability tool can be used to detect when a network policy blocked traffic
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Network observability isn't available by default. Without network observability, how will we identify and resolve issues?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We would need to enable https://docs.openshift.com/container-platform/4.18/networking/network_security/logging-network-security.html#nw-networkpolicy-audit-configure_logging-network-security by default with a low rate limit. We'd also need to make sure that gets in to a must-gather. I'll update the doc.

Comment on lines +280 to +282
Are there any unique considerations for making this change work with Hypershift?

No.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm surprised there is no impact to having the kube-apiserver run outside the cluster and reach into the cluster to call things like admission webhooks. How about a specific admission webhook and CRD conversion webhook test to confirm that there is no HCP impact?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I had hoped we would not need a webhook. I thought from our earlier conversation about this that you were not opposed to changing the namespace admission controller. But if we need to have a webhook, then I agree with your concern. Will update.

Comment on lines +299 to +302
How does this proposal affect MicroShift? For example, if the proposal
adds configuration options through API resources, should any of those
behaviors also be exposed to MicroShift admins through the
configuration file for MicroShift?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Microshift doesn't run the cluster-network-operator. How will upgrades of microshift get the appropriate namespace labels? How will upgrades prevent non-openshift-* namespaces from squatting on the label?

Copy link
Contributor

openshift-ci bot commented Jan 7, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from knobunc. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-bot
Copy link

Stale enhancement proposals rot after 7d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle rotten.
Rotten proposals close after an additional 7d of inactivity.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

@openshift-ci openshift-ci bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 15, 2025
@openshift-bot
Copy link

Rotten enhancement proposals close after 7d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Reopen the proposal by commenting /reopen.
Mark the proposal as fresh by commenting /remove-lifecycle rotten.
Exclude this proposal from closing again by commenting /lifecycle frozen.

/close

@openshift-ci openshift-ci bot closed this Jan 22, 2025
Copy link
Contributor

openshift-ci bot commented Jan 22, 2025

@openshift-bot: Closed this PR.

In response to this:

Rotten enhancement proposals close after 7d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Reopen the proposal by commenting /reopen.
Mark the proposal as fresh by commenting /remove-lifecycle rotten.
Exclude this proposal from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@knobunc
Copy link
Contributor Author

knobunc commented Mar 4, 2025

/reopen
/remove-lifecycle rotten

@openshift-ci openshift-ci bot reopened this Mar 4, 2025
Copy link
Contributor

openshift-ci bot commented Mar 4, 2025

@knobunc: Reopened this PR.

In response to this:

/reopen
/remove-lifecycle rotten

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-ci openshift-ci bot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Mar 4, 2025
Copy link
Contributor

openshift-ci bot commented Mar 4, 2025

@knobunc: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/markdownlint 25b3070 link true /test markdownlint

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@knobunc knobunc changed the title WIP Enhancement to outline path for network policies for all core components Enhancement to outline path for network policies for all core components Mar 12, 2025
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 12, 2025
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants