You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Additionally, the stub Ignition config [referenced](https://github.com/openshift/installer/blob/1ca0848f0f8b2ca9758493afa26bf43ebcd70410/pkg/asset/machines/gcp/machines.go#L197) in the `MachineSet` is also not managed. This stub is used by the ignition binary in firstboot to auth and consume content from the `machine-config-server`(MCS). The content served includes the actual Ignition configuration and the target OCI format RHCOS image. The ignition binary now does first boot provisioning based on this, then hands off to the `machine-config-daemon`(MCD) first boot service to do the reboot into the target OCI format RHCOS image.
47
48
@@ -51,6 +52,9 @@ In certain long lived clusters, the MCS TLS cert contained within the above Igni
51
52
52
53
**Note**: As of 4.19, the MCO supports [management of this TLS cert](https://issues.redhat.com/browse/MCO-1208). With this work in place, the MCO can now attempt to upgrade the stub Ignition config, instead of hardcoding to the `*-managed` stub as mentioned previously. This will help preserve any user customizations that were present in the stub Ignition config.
53
54
55
+
This is also considered a blocking issue for [SigStore GA](https://issues.redhat.com/browse/OCPNODE-2619). It has caused issues such as [OCPBUGS-38809](https://issues.redhat.com/browse/OCPBUGS-38809) due to the older podman binary not being able to understand `sigstoreSigned` fields in `/etc/containers/policy.json`. There can be similar issues in the future that can be hard to anticipate.
56
+
57
+
54
58
This is also a soft pre-requisite for both dual-stream RHEL support in OpenShift, and on-cluster layered builds. RPM-OSTree presently does a deploy-from-self to get a new-enough rpm-ostree to deploy image-based RHEL CoreOS systems, and we would like to avoid doing this for bootc if possible. We would also like to prevent RHEL8->RHEL10 direct updates once that is available for OpenShift.
55
59
56
60
### User Stories
@@ -92,7 +96,6 @@ __Overview__
92
96
#### Error & Alert Mechanism
93
97
94
98
MSBIC sync failures may be caused by multiple reasons:
95
-
- The MSBIC notices an OwnerReference and is able to determine that updating the `MachineSet` will likely cause thrashing. This is considered a misconfiguration and in such cases, the user is expected to exclude this `MachineSet` from boot image management.
96
99
- The `coreos-bootimages` ConfigMap is unavailable or in an incorrect format. This will likely happen if a user manually edits the ConfigMap, overriding the CVO.
97
100
- The `coreos-bootimages` ConfigMap takes too long to be stamped by the MCO. This indicates that there are larger problems in the cluster such as an upgrade failure/timeout or an unrelated cluster failure.
98
101
- Patching the `MachineSet` fails. This indicates a temporary API server blip, or larger RBAC issues.
@@ -115,7 +118,7 @@ Any form factor using the MCO and `MachineSets` will be impacted by this proposa
115
118
- Standalone OpenShift: Yes, this is the main target form factor.
116
119
- microshift: No, as it does [not](https://github.com/openshift/microshift/blob/main/docs/contributor/enabled_apis.md) use `MachineSets`.
117
120
- Hypershift: No, Hypershift does not have this issue.
118
-
- Hive: Hive manages `MachineSets` via `MachinePools`. The MachinePool controller generates the `MachineSets` manifests (by invoking vendored installer code) which include the `providerSpec`. Once a `MachineSet` has been created on the spoke, the only things that will be reconciled on it are replicas, labels, and taints - [unless a backdoor is enabled](https://github.com/openshift/hive/blob/0d5507f91935701146f3615c990941f24bd42fe1/pkg/constants/constants.go#L518). If the `providerSpec` ever goes out of sync, a warning will be logged by the MachinePool controller but otherwise this discrepancy is ignored. In such cases, the MSBIC will not have any issue reconciling the `providerSpec` to the correct boot image. However, if the backdoor is enabled, both the MSBIC and the MachinePool Controller will attempt to reconcile the `providerSpec` field, causing churn. The Hive team will update the comment on the backdoor annotation to indicate that it is mutually exclusive with this feature.
121
+
- Hive: Hive manages `MachineSets` via `MachinePools`. The MachinePool controller generates the `MachineSets` manifests (by invoking vendored installer code) which include the `providerSpec`. Once a `MachineSet` has been created on the spoke, the only things that will be reconciled on it are replicas, labels, and taints - [unless a backdoor is enabled](https://github.com/openshift/hive/blob/0d5507f91935701146f3615c990941f24bd42fe1/pkg/constants/constants.go#L518). If the `providerSpec` ever goes out of sync, a warning will be logged by the MachinePool controller but otherwise this discrepancy is ignored. In such cases, the MSBIC will not have any issue reconciling the `providerSpec` to the correct boot image. However, if the backdoor is enabled, both the MSBIC and the MachinePool Controller will attempt to reconcile the `providerSpec` field, causing churn. The Hive team has [updated the comment](https://github.com/openshift/hive/pull/2596/files) on the backdoor annotation to indicate that it is mutually exclusive with this feature.
119
122
120
123
##### Supported platforms
121
124
@@ -135,13 +138,14 @@ This work will be tracked in [MCO-793](https://issues.redhat.com/browse/MCO-793)
135
138
136
139
##### Projected timeline
137
140
138
-
This is a tentative timeline, subject to change (GA = General Availability, TP = Tech Preview, DEF = Default-on).
141
+
This is a tentative timeline, subject to change (GA = General Availability(opt-in), TP = Tech Preview(opt-in), DEF = Default-on(opt-out)).
- For bookkeeping purposes, the MCO will annotate the `MachineConfiguration` object when opting in the cluster by default.
593
-
- If the cluster admin wishes to opt-out of the feature, they have to do so by removing the boot image configuration or explicitly opting out the cluster via the API knob. Due to the presence of the "default opted-in" annotation, the MCO will not attempt to opt-in the cluster by default again.
594
671
- This mechanism will be active on installs and upgrades.
672
+
- If the cluster admin wishes to opt-out of the feature, they have to do so by explicitly opting out the cluster via the API knob prior to the upgrade.
673
+
- If any of the MachineSets have an OwnerReference, it will be skipped for boot image updates. This will cause an alert/warning to the cluster admin, but it will no longer cause a degrade.
595
674
596
675
597
676
### Enforcement of bootimage skew
@@ -607,19 +686,15 @@ The release payload will describe the current skew policy. The structure of this
607
686
Some combination of the following mechanisms should be implemented to alert users, particularly non-machineset backed scaled environments. The options generally fall under proactive enforcement (require users to either update or acknowledge risk before upgrading to a new version) vs. reactive enforcement (only fail when a non-compliant bootimage is being used to scale into the cluster).
608
687
609
688
#### Proactive
610
-
Introduce a new configmap in the MCO namespace that will store the last updated boot image and allows for easy comparison against the
611
-
skew policy described in the release payload.
689
+
Add a new field in the `coreos-bootimages` configmap in the MCO namespace that will store the cluster's current boot image and allows for easy comparison against the skew policy described in the release payload.
612
690
- For machineset backed clusters, this would be updated by the MSBIC after it succesfully updates boot images.
613
691
- For non-machineset backed clusters, this would be updated by the cluster admin to indicate the last manually updated bootimage. The cluster admin would need to update this configmap every few releases, when the RHEL minor on which the RHCOS container is built on changes (e.g. 9.6->9.8).
614
692
615
693
The cluster admin may also choose to opt-out of skew management via this configmap, which indicates that they will not require scaling nodes, and thereby opting out of skew enforcement and scaling functionality.
616
694
617
695
A potential problem here is that the way boot images are stored in the machineset is lossy. In certain platforms, there is no way to recover the boot image metadata from the MachineSet. This is most likely to happen the first time the MCO attempts to do skew enforcement on a cluster that has never had boot image updates. In such cases, the MCO will default to the install time boot image, which can be recovered from the [aleph version](https://github.com/coreos/coreos-assembler/pull/768) of the control plane nodes.
618
696
619
-
This configmap can then be monitored to enforce skew limits. This could be done in a couple of ways:
620
-
-**via the MCO**: If the skew is determined to be too large, the MCO can update its `ClusterOperator` object with an `Upgradeable=False` condition, along with remediation steps in the `Condition` message. This will signal to the CVO that the cluster is not suitable for an upgrade. The drawback of this approach is that the MCO is not able to signal *prior* to the start of a cluster upgrade, so if an incoming upgrade has a "stricter" skew policy, this could break scaling until the admin takes the remediation steps during the upgrade or after the upgrade is complete. This may present as strange UX to the user.
621
-
622
-
-**via the CVO**: If the CVO is able to do the configmap monitoring, the enforcement can be a bit more proactive. The CVO could then potentially block an incoming upgrade based on the skew policy described in the new release payload, until the remediation steps have been done.
697
+
This configmap can then be monitored to enforce skew limits. This could be done in a couple of ways. If the skew is determined to be too large, the MCO can update its `ClusterOperator` object with an `Upgradeable=False` condition, along with remediation steps in the `Condition` message. This will signal to the CVO that the cluster is not suitable for an upgrade.
623
698
624
699
As stated earlier, to remediate, the cluster admin would then have to do one of the following:
625
700
- Turn on boot image updates if it is a machineset backed cluster.
0 commit comments