New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

#

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Jump to bottom

Proposal for adding Self healing feature in the operator #145

Open

ShubhamRwt wants to merge 3 commits into strimzi:main from ShubhamRwt:selfHealingFeature

+253 −0

Contributor

ShubhamRwt commented Jan 10, 2025

This proposal shows how we plan to implement the self healing feature in the operator


          Add proposal for adding Self healing feature

aa6593e

Signed-off-by: ShubhamRwt <shubhamrwt02@gmail.com>

ShubhamRwt requested review from tombentley, tomncooper, scholzj, ppatierno and kyguy

January 10, 2025 08:41

scholzj reviewed

View reviewed changes

Member

scholzj left a comment

Thanks for the proposal.

I think it would be useful to use a regular human language here and not the Cruise Control language as that would bring better perspective on what exactly is being proposed here and what is not being proposed here (and what actually the self-healing does and does not do).

I do not understand the role of the StrimziNotifier. While I doubt that anyone will find it appealing to have Kubernetes events used for the notification instead of Slack or Prometheus, I cannot of course rule out that this will find its users. But it seems to be only marginally related to Strimzi. I'm sure we can host this under the Strimzi GitHub organization if desired, but it seems more like an unrelated subproject that is separate from the self-healing feature it self.

The actual self-healing ... you focus a lot on the Disk and Broker failures. But it seems to me like the self-healing does not offer good solutions for this. The idea to solve this by rescheduling the partition replicas to other brokers seems to be well suited for Kafka on bare-metal with local storage where a dead server often means that you have to get a brand new server to replace it. It does not seem to be well suited to self-healing Kubernetes infrastructure and for cloud-based environments, where Pods can be just rescheduled and new disk can be provisioned in the matter of seconds.

So it seems to me like the self-healing itself as done by Cruise Control could be useful for dealing with inbalanced brokers. But for situations such as broker or disk failures or insufficient capacity, we should be more ambitious and aim to solve them properly in the Strimzi Cluster Operator. That is where an actual StrimziNotifier might play a future role to not issue any Kubernetes Events to tell the user about broker disk, but provide a mechanism to tell Strimzi about it so that Strimzi can react to it - e.g. by scaling for capacity based recommendation from CC or as suggested delete broken PVC and get a new one provisioned to replace it. While I underttand that this might be out of scope for this proposal, I think it is important to leave space for it. For example by focusing here on the auto-rebalancing features and leaving the other detected issues unused for future development.

092-self-healing-using-cruise-control.md Outdated

+              ## Current situation
+              During normal operations it's common for Kafka clusters to become unbalanced (through partition key skew, uneven partition placement etc.) or for problems to occur with disks or other underlying hardware which makes the cluster unhealthy.
+              Currently, if we encounter any such scenario we need to fix these issues manually i.e. if there is some broker failure then we might move the partition replicas from that corrupted broker to some other healthy broker by using the `KafkaRebalance` custom resource in the [`remove-broker`](https://strimzi.io/docs/operators/latest/full/deploying.html#proc-generating-optimization-proposals-str) mode.

Member

scholzj Jan 11, 2025

This is not some bare-metal cluster but Kubernetes. So I'm struggling to find the issues where you would need to do what you suggest. Instead, on Kubernetes, you would normally just fix the issue on the existing broker.

Member

ppatierno Jan 21, 2025

I can see your point. At Kuberntes level, maybe a broker failure is more related to lack of resources which you should fix at cluster level and not removing the broker. A better example could be a disk failure on a broker or some goal violation (which is the usual reason why we use manual rebalancing).

092-self-healing-using-cruise-control.md Outdated

+              During normal operations it's common for Kafka clusters to become unbalanced (through partition key skew, uneven partition placement etc.) or for problems to occur with disks or other underlying hardware which makes the cluster unhealthy.
+              Currently, if we encounter any such scenario we need to fix these issues manually i.e. if there is some broker failure then we might move the partition replicas from that corrupted broker to some other healthy broker by using the `KafkaRebalance` custom resource in the [`remove-broker`](https://strimzi.io/docs/operators/latest/full/deploying.html#proc-generating-optimization-proposals-str) mode.
+              With smaller clusters, it is feasible to fix things manually. However, for larger ones it can be very time-consuming, or just not feasible, to fix all the issue on your own.

Member

scholzj Jan 11, 2025

I think you should clarify what issues are you talking about here. From the past discussion, the terminology used here seems highly confusing ... and I suspect that the words such as issues, anomalies, and problems are ambiguous at best`.

Member

ppatierno Jan 21, 2025

For me they are all synonyms more or less. Maybe just use one and stick with it?

092-self-healing-using-cruise-control.md Outdated


		## Motivation

		With the help of the self-healing feature we can resolve the issues like disk failure, broker failure and other issues automatically.

Member

scholzj Jan 11, 2025

Nothing in this proposal suggests that Cruise Control can solve disk or broker failures.

Member

ppatierno Jan 21, 2025

AFAICS, the proposal introduces the concept of broker/disk failure detectors and related fix class which highlight that CC can solve them. Or what are you looking for?

092-self-healing-using-cruise-control.md Outdated

+              With the help of the self-healing feature we can resolve the issues like disk failure, broker failure and other issues automatically.
+              Cruise Control treats these issues as "anomalies", the detection of which is the responsibility of the anomaly detector mechanism.
+              Currently, users are [able to set]((https://strimzi.io/docs/operators/latest/full/deploying.html#setting_up_alerts_for_anomaly_detection)) the notifier to one of those included with Cruise Control (<list of included notifiers>).

Member

scholzj Jan 11, 2025

The link seems broken? (Double ( and )?)

092-self-healing-using-cruise-control.md Outdated

+              The anomaly detector manager helps in detecting the anomalies and also handling the detected anomalies.
+              It acts as a coordinator between the detector classes as well as the classes which will be handling and resolving the anomalies.
+              Various detector classes like `GoalViolationDetector`, `DiskFailureDetector`, `KafkaBrokerFailureDetector` etc. are used for the anomaly detection, which runs periodically to check if the cluster have their corresponding anomalies or not (This periodic check can be easily configured through the `anomaly.detection.interval.ms` configuration).
+              Every detector class works in a different way to detect their corresponding anomalies. A `KafkaBrokerFailureDetector` will require a deep usage of requesting metadata to the Kafka cluster (i.e. broker failure) while if we talk about `DiskFailureDetector` it will be more inclined towards using the admin client API. In the same way the `MetricAnomalyDetector` will be making use of metrics and other detector like `GoalViolationDetector` and `TopicAnomalyDetector` will have their own way to detect anomalies.

Member

scholzj Jan 11, 2025

Not sure what exactly it is trying to say ... but it does not seem to make sense to me.

Contributor

tinaselenge Jan 14, 2025

I'm not sure if we need this level of details. Maybe just link to the doc of these detectors? If not, maybe reword the sentence to something like this:

Suggested change

      
            Every detector class works in a different way to detect their corresponding anomalies. A `KafkaBrokerFailureDetector` will require a deep usage of requesting metadata to the Kafka cluster (i.e. broker failure) while if we talk about `DiskFailureDetector` it will be more inclined towards using the admin client API. In the same way the `MetricAnomalyDetector` will be making use of metrics and other detector like `GoalViolationDetector` and `TopicAnomalyDetector` will have their own way to detect anomalies.
          
            Detector classes have different mechanisms to detect their corresponding anomalies. For example, `KafkaBrokerFailureDetector` utilises Kafka Metadata API whereas `DiskFailureDetector` utilises Kafka Admin API. Furthermore, `MetricAnomalyDetector` uses metrics while other detector such as `GoalViolationDetector` and `TopicAnomalyDetector` use a different mechanism to detect anomalies.

Maybe we should mention briefly how GoalViolationDetector and TopicAnomalyDetector detect anomalies like these others too.

Member

ppatierno Jan 21, 2025

Not sure what exactly it is trying to say ... but it does not seem to make sense to me.

What the text is trying to say is that each detector has its own way to detect the anomaly (i.e. using admin client API, or deep metadata API request, or reading metrics and so on). Maybe the changes proposed by @tinaselenge makes it clearer?

092-self-healing-using-cruise-control.md Outdated

+              ```
+              The `StrimziNotifier` class is going to reside in a separate module in the operator repository called the `strimzi-cruise-control-notifier` so that a seperate jar for the notifier can be built for inclusion in the Cruise Control container image.
+              We will also need to include the fabric8 client dependencies in the notifier JAR that will be used for generating the Kubernetes events as well as for reading the annotation on the Kafka resource from the notifier.

Member

scholzj Jan 11, 2025

Reading what annotation?

Member

ppatierno Jan 21, 2025

Yeah, I think you are anticipating what's going to be in the next paragraph but this sentence doesn't make any sense here.

092-self-healing-using-cruise-control.md Outdated

Comment on lines 185 to 190

+              The user should also have the option to pause the self-healing feature, in case they know or plan to do some rolling restart of the cluster or some rebalance they are going to perform which can cause issue with the self-healing itself.
+              For this, we will be having an annotation called `strimzi.io/self-healing-paused`.
+              We can set this annotation on the Kafka custom resource to pause the self-healing feature.
+              self-healing is not paused by default so missing annotation would mean that anomalies detected will keep getting healed automatically.
+              When self-healing is paused, it would mean that all the anomalies that are detected after the pause annotation is applied would be ignored and no fix would be done for them.
+              The anomalies that were already getting fixed would continue with the fix.

Member

scholzj Jan 11, 2025

Why is pausing the self-healing useful for the users? Assuming it is useful, why not just disable the self-healing?

Member

ppatierno Jan 21, 2025

Disabling self-healing would mean restarting CC. Pausing mean that the notifier just returns "ignore" to the anomaly detector (but you could still see anomalies logged in the CC pod for example).

092-self-healing-using-cruise-control.md Outdated


		#### What happens when self-healing is running/fixing an anomaly and brokers are rolled

		If a broker fails, self-healing will trigger and Cruise Control will try to move the partition replicas from the failed broker to other healthy brokers in the Kafka cluster.

Member

scholzj Jan 11, 2025

What exactly does it mean when the broker fails? How does CC recognize it? How does it distinguish failure for example from some temporary Kubernetes scheduling issue?

Member

ppatierno Jan 21, 2025

The broker anomaly detector uses deep metadata request to the Kafka cluster to know the brokers status. The broker failure in particular is not fixed straight away but it's first delayed in order to double check "later" if the broker is still offline or not. All the timing between double checks are configurable. So any temporary issue will not be reported as anomaly and tried to be fixed.

092-self-healing-using-cruise-control.md Outdated

+              #### What happens when self-healing is running/fixing an anomaly and brokers are rolled
+              If a broker fails, self-healing will trigger and Cruise Control will try to move the partition replicas from the failed broker to other healthy brokers in the Kafka cluster.
+              If, during the process of moving partitions from the failed broker, another broker in the Kafka cluster gets deleted or rolled, then Cruise Control would finish the transfer from the failed broker but log that `self-healing finished successful with error`.

Member

scholzj Jan 11, 2025

How would it finish the transfer from the failed broker when the target broker is stopped? You mean that it will continue with it after the target broker is rolled?

Contributor Author

ShubhamRwt Jan 13, 2025

Yes, If the target broker is rolling at that moment, then self healing would end saying self-healing finished successful with error. Then periodic anomaly check will again detect the failed broker and try to fix it. If the target broker is rolled by that time then self healing will start again else if the rolled broker is not up then we will will get two anomalies to fix the failed broker and the target broker

092-self-healing-using-cruise-control.md


		#### What happens when self-healing is running/fixing an anomaly and Kafka Rebalance is applied

		If self-healing is ongoing and a `KafkaRebalance` resource is posted in the middle of it then the `KafkaRebalance` resource will move to `NotReady` state, stating that there is already a rebalance ongoing.

Member

scholzj Jan 11, 2025

Does this need to differenciate between asking for proposal and approving a proposal? For example when asking for proposal, I would expect the operator to wait and give me a proposal after the self-healing ends.

Member

ppatierno Jan 21, 2025

From the following snippet I can see the dryrun=false so this seems a condition happening when you approve a proposal. @ShubhamRwt does it mean that during self-healing, CC was able to return to you a proposal and you were able to approve it and then getting the error?

tinaselenge reviewed

View reviewed changes

092-self-healing-using-cruise-control.md Outdated


		#### Anomaly Detector Manager

		The anomaly detector manager helps in detecting the anomalies and also handling the detected anomalies.

Contributor

tinaselenge Jan 14, 2025

Suggested change

      
            The anomaly detector manager helps in detecting the anomalies and also handling the detected anomalies.
          
            The anomaly detector manager helps in detecting the anomalies as well as handling them.

tinaselenge reviewed

View reviewed changes

092-self-healing-using-cruise-control.md Outdated

+              Various detector classes like `GoalViolationDetector`, `DiskFailureDetector`, `KafkaBrokerFailureDetector` etc. are used for the anomaly detection, which runs periodically to check if the cluster have their corresponding anomalies or not (This periodic check can be easily configured through the `anomaly.detection.interval.ms` configuration).
+              Every detector class works in a different way to detect their corresponding anomalies. A `KafkaBrokerFailureDetector` will require a deep usage of requesting metadata to the Kafka cluster (i.e. broker failure) while if we talk about `DiskFailureDetector` it will be more inclined towards using the admin client API. In the same way the `MetricAnomalyDetector` will be making use of metrics and other detector like `GoalViolationDetector` and `TopicAnomalyDetector` will have their own way to detect anomalies.
+              The detected anomalies can be of various types:
+              * Goal Violation - This happens if certain optimization goals are violated (e.g. DiskUsageDistributionGoal etc.). These goals can be configured independently (through the `self.healing.goals` config) to those used for manual rebalancing..

Contributor

tinaselenge Jan 14, 2025

So you mean the goals in KafkaRebalance and goals self.healing.goals are independent of each other?

Contributor Author

ShubhamRwt Jan 15, 2025

if you have specified hard goals in the KR, then the self.healing.goals should be a superset of the hard goals.

Member

ppatierno Jan 21, 2025

if you have specified hard goals in the KR, then the self.healing.goals should be a superset of the hard goals.

If by KR you mean the KafkaRebalance custom resource, I don't think what you are saying is true. There is no relationship between a specific KafkaRebalance and self-healing configuration.
Maybe are you referring to the general CC configuration? So the default hard goals in CC has to be a superset of self.healing.goals (and not the other way around)?

tinaselenge reviewed

View reviewed changes

092-self-healing-using-cruise-control.md Outdated

+              The anomaly detector manager calls the notifier to get an action regarding, whether the anomaly should be fixed, delayed or ignored.
+              If the action is fix, then the anomaly detector manager calls the classes that are required to resolve the anomaly.
+              Anomaly detection also has several other [configurations](https://github.com/linkedin/cruise-control/wiki/Configurations#anomalydetector-configurations), such as the detection interval and the anomaly notifier class, which can effect the performance of the Cruise Control server and the latency of the anomaly detection.

Contributor

tinaselenge Jan 14, 2025

Suggested change

      
            Anomaly detection also has several other [configurations](https://github.com/linkedin/cruise-control/wiki/Configurations#anomalydetector-configurations), such as the detection interval and the anomaly notifier class, which can effect the performance of the Cruise Control server and the latency of the anomaly detection.
          
            Anomaly detection also has various [configurations](https://github.com/linkedin/cruise-control/wiki/Configurations#anomalydetector-configurations), such as the detection interval and the anomaly notifier class, which can affect the performance of the Cruise Control server and the latency of the anomaly detection.

tinaselenge reviewed

View reviewed changes

092-self-healing-using-cruise-control.md

+              Cruise Control also provides [custom notifiers](https://github.com/linkedin/cruise-control/wiki/Configure-notifications) like Slack Notifier, Alerta Notifier etc. for notifying users regarding the anomalies. There are multiple other [self-healing notifier](https://github.com/linkedin/cruise-control/wiki/Configurations#selfhealingnotifier-configurations) related configurations you can use to make notifiers more efficient as per the use case.
+              #### Self Healing
+              If self-healing is enabled in Cruise Control and anomalies are detected, then a fix for these will be attempted automatically.

Contributor

tinaselenge Jan 14, 2025

A fix will be attempted automatically only if anomaly.notifier.class is configured to non default notifier right? So enabling self-healing alone is not enough, as with the default notifier (NoopNotifer), anomalies are detected but ignored?

Contributor Author

ShubhamRwt Jan 15, 2025

Yes, you will need to override the notifier too

tinaselenge reviewed

View reviewed changes

092-self-healing-using-cruise-control.md Outdated

+              The anomaly status would change based on how the healing progresses. The anomaly can transition to these possible status:
+              * DETECTED - The anomaly is just detected and no action is yet taken on it
+              * IGNORED - Self healing is either disabled or the anomaly is unfixable
+              * FIX_STARTED - The anomaly has started getting fixed

Contributor

tinaselenge Jan 14, 2025 •

edited

Loading

Suggested change

      
            * FIX_STARTED - The anomaly has started getting fixed
          
            * FIX_STARTED - The anomaly fix has started

tinaselenge reviewed

View reviewed changes

092-self-healing-using-cruise-control.md Outdated

+              * FIX_STARTED - The anomaly has started getting fixed
+              * FIX_FAILED_TO_START - The proposal generation for the anomaly fix failed
+              * CHECK_WITH_DELAY - The anomaly fix is delayed
+              * LOAD_MONITOR_NOT_READY - Monitor which is monitoring the workload of a Kafka cluster is in loading or bootstrap state

Contributor

tinaselenge Jan 14, 2025

Suggested change

      
            * LOAD_MONITOR_NOT_READY - Monitor which is monitoring the workload of a Kafka cluster is in loading or bootstrap state
          
            * LOAD_MONITOR_NOT_READY - Monitor for workload of a Kafka cluster is in loading or bootstrap state

tinaselenge reviewed

View reviewed changes

092-self-healing-using-cruise-control.md Outdated

+                    self.healing.enabled: true
+              ```
+              Where the user wants to enable self-healing for some specific anomalies, they can use configurations like `self.healing.broker.failure.enabled` for broker failure, `self.healing.goal.violation.enabled` for goal violation etc.

Contributor

tinaselenge Jan 14, 2025

Is there any anomalies enabled by default? If user does not specify anomalies to detect, self-healing will do nothing?

Contributor Author

ShubhamRwt Jan 15, 2025 •

edited

Loading

If the user doesn't specify any specific anomalies(i.e you used just self.healing.enabled: true) then self healing would be detecting all the anomalies. In case the user knows that somehow any anomaly will keep appearing since they have some issues in their cluster, then in that they can use skip the check for that particular anomaly

Member

kyguy Jan 30, 2025

If the user doesn't specify any specific anomalies(i.e you used just self.healing.enabled: true) then self healing would be detecting all the anomalies.

I wonder if it would be worth having all anomaly detection disabled by default and force users to explicitly enable which specific anomalies they want to detect and have fixed automatically. I imagine the alerting of some anomalies might be pretty noisy and the automatic fixes may be quite taxing on the cluster, having all anomalies enabled at once by default might surprise and overload users not familiar with the feature.

tinaselenge reviewed

View reviewed changes

092-self-healing-using-cruise-control.md

+              The notifier class returns the `action` that is going to be taken on the notified anomaly.
+              These actions can be classified into three types:
+              * FIX - Start the anomaly fix
+              * CHECK - Delay the anomaly fix

Contributor

tinaselenge Jan 14, 2025

does the delay action, apply a configured delay before taking the action provided by notifier?

ppatierno reviewed

View reviewed changes

Member

ppatierno left a comment

I left some comments. Can you go through the proposal and split paragraphs with one sentence per line? Thanks!

092-self-healing-using-cruise-control.md Outdated

+              ## Current situation
+              During normal operations it's common for Kafka clusters to become unbalanced (through partition key skew, uneven partition placement etc.) or for problems to occur with disks or other underlying hardware which makes the cluster unhealthy.
+              Currently, if we encounter any such scenario we need to fix these issues manually i.e. if there is some broker failure then we might move the partition replicas from that corrupted broker to some other healthy broker by using the `KafkaRebalance` custom resource in the [`remove-broker`](https://strimzi.io/docs/operators/latest/full/deploying.html#proc-generating-optimization-proposals-str) mode.

Member

ppatierno Jan 21, 2025

I can see your point. At Kuberntes level, maybe a broker failure is more related to lack of resources which you should fix at cluster level and not removing the broker. A better example could be a disk failure on a broker or some goal violation (which is the usual reason why we use manual rebalancing).

092-self-healing-using-cruise-control.md Outdated

+              During normal operations it's common for Kafka clusters to become unbalanced (through partition key skew, uneven partition placement etc.) or for problems to occur with disks or other underlying hardware which makes the cluster unhealthy.
+              Currently, if we encounter any such scenario we need to fix these issues manually i.e. if there is some broker failure then we might move the partition replicas from that corrupted broker to some other healthy broker by using the `KafkaRebalance` custom resource in the [`remove-broker`](https://strimzi.io/docs/operators/latest/full/deploying.html#proc-generating-optimization-proposals-str) mode.
+              With smaller clusters, it is feasible to fix things manually. However, for larger ones it can be very time-consuming, or just not feasible, to fix all the issue on your own.

Member

ppatierno Jan 21, 2025

For me they are all synonyms more or less. Maybe just use one and stick with it?

092-self-healing-using-cruise-control.md Outdated


		## Motivation

		With the help of the self-healing feature we can resolve the issues like disk failure, broker failure and other issues automatically.

Member

ppatierno Jan 21, 2025

AFAICS, the proposal introduces the concept of broker/disk failure detectors and related fix class which highlight that CC can solve them. Or what are you looking for?

092-self-healing-using-cruise-control.md Outdated

+              With the help of the self-healing feature we can resolve the issues like disk failure, broker failure and other issues automatically.
+              Cruise Control treats these issues as "anomalies", the detection of which is the responsibility of the anomaly detector mechanism.
+              Currently, users are [able to set]((https://strimzi.io/docs/operators/latest/full/deploying.html#setting_up_alerts_for_anomaly_detection)) the notifier to one of those included with Cruise Control (<list of included notifiers>).

Member

ppatierno Jan 21, 2025

This is the first time you mention the term "notifier" but it shows up out of the context. You should first explain what it is and why it's useful in general, not related to the self-healing but more to the alerting.

092-self-healing-using-cruise-control.md Outdated

+              The anomaly detector manager helps in detecting the anomalies and also handling the detected anomalies.
+              It acts as a coordinator between the detector classes as well as the classes which will be handling and resolving the anomalies.
+              Various detector classes like `GoalViolationDetector`, `DiskFailureDetector`, `KafkaBrokerFailureDetector` etc. are used for the anomaly detection, which runs periodically to check if the cluster have their corresponding anomalies or not (This periodic check can be easily configured through the `anomaly.detection.interval.ms` configuration).
+              Every detector class works in a different way to detect their corresponding anomalies. A `KafkaBrokerFailureDetector` will require a deep usage of requesting metadata to the Kafka cluster (i.e. broker failure) while if we talk about `DiskFailureDetector` it will be more inclined towards using the admin client API. In the same way the `MetricAnomalyDetector` will be making use of metrics and other detector like `GoalViolationDetector` and `TopicAnomalyDetector` will have their own way to detect anomalies.

Member

ppatierno Jan 21, 2025

Not sure what exactly it is trying to say ... but it does not seem to make sense to me.

What the text is trying to say is that each detector has its own way to detect the anomaly (i.e. using admin client API, or deep metadata API request, or reading metrics and so on). Maybe the changes proposed by @tinaselenge makes it clearer?

092-self-healing-using-cruise-control.md Outdated

+              ### Pausing the self-healing feature
+              The user should also have the option to pause the self-healing feature, in case they know or plan to do some rolling restart of the cluster or some rebalance they are going to perform which can cause issue with the self-healing itself.
+              For this, we will be having an annotation called `strimzi.io/self-healing-paused`.

Member

ppatierno Jan 21, 2025

Suggested change

      
            For this, we will be having an annotation called `strimzi.io/self-healing-paused`.
          
            For this, we will be having an annotation called `strimzi.io/self-healing-paused` on the `Kafka` custom resource.

092-self-healing-using-cruise-control.md Outdated

Comment on lines 185 to 190

+              The user should also have the option to pause the self-healing feature, in case they know or plan to do some rolling restart of the cluster or some rebalance they are going to perform which can cause issue with the self-healing itself.
+              For this, we will be having an annotation called `strimzi.io/self-healing-paused`.
+              We can set this annotation on the Kafka custom resource to pause the self-healing feature.
+              self-healing is not paused by default so missing annotation would mean that anomalies detected will keep getting healed automatically.
+              When self-healing is paused, it would mean that all the anomalies that are detected after the pause annotation is applied would be ignored and no fix would be done for them.
+              The anomalies that were already getting fixed would continue with the fix.

Member

ppatierno Jan 21, 2025

Disabling self-healing would mean restarting CC. Pausing mean that the notifier just returns "ignore" to the anomaly detector (but you could still see anomalies logged in the CC pod for example).

092-self-healing-using-cruise-control.md Outdated

+                uid: 71928731-df71-4b9f-ac77-68961ac25a2f
+              ```
+              We will have checks on methods like `onGoalViolation`, `onBrokerFailure`, `onTopicAnomaly` etc. in the `StrimziNotifier` to check if the `pause` annotation is applied on the Kafka custom resource or not.

Member

ppatierno Jan 21, 2025

there is no pause annotation. Use the exact name.

092-self-healing-using-cruise-control.md Outdated


		#### What happens when self-healing is running/fixing an anomaly and brokers are rolled

		If a broker fails, self-healing will trigger and Cruise Control will try to move the partition replicas from the failed broker to other healthy brokers in the Kafka cluster.

Member

ppatierno Jan 21, 2025

The broker anomaly detector uses deep metadata request to the Kafka cluster to know the brokers status. The broker failure in particular is not fixed straight away but it's first delayed in order to double check "later" if the broker is still offline or not. All the timing between double checks are configurable. So any temporary issue will not be reported as anomaly and tried to be fixed.

092-self-healing-using-cruise-control.md


		#### What happens when self-healing is running/fixing an anomaly and Kafka Rebalance is applied

		If self-healing is ongoing and a `KafkaRebalance` resource is posted in the middle of it then the `KafkaRebalance` resource will move to `NotReady` state, stating that there is already a rebalance ongoing.

Member

ppatierno Jan 21, 2025

From the following snippet I can see the dryrun=false so this seems a condition happening when you approve a proposal. @ShubhamRwt does it mean that during self-healing, CC was able to return to you a proposal and you were able to approve it and then getting the error?

kyguy reviewed

View reviewed changes

092-self-healing-using-cruise-control.md Outdated

+              The anomaly detector manager helps in detecting the anomalies and also handling the detected anomalies.
+              It acts as a coordinator between the detector classes as well as the classes which will be handling and resolving the anomalies.
+              Various detector classes like `GoalViolationDetector`, `DiskFailureDetector`, `KafkaBrokerFailureDetector` etc. are used for the anomaly detection, which runs periodically to check if the cluster have their corresponding anomalies or not (This periodic check can be easily configured through the `anomaly.detection.interval.ms` configuration).

Member

kyguy Jan 29, 2025

Suggested change

      
            Various detector classes like `GoalViolationDetector`, `DiskFailureDetector`, `KafkaBrokerFailureDetector` etc. are used for the anomaly detection, which runs periodically to check if the cluster have their corresponding anomalies or not (This periodic check can be easily configured through the `anomaly.detection.interval.ms` configuration).
          
            Various detector classes like `GoalViolationDetector`, `DiskFailureDetector`, `KafkaBrokerFailureDetector` etc. are used for the anomaly detection, which runs periodically to check if the cluster have their corresponding anomalies or not (The frequency of this check can be easily configured through the `anomaly.detection.interval.ms` configuration).

092-self-healing-using-cruise-control.md Outdated

+              Various detector classes like `GoalViolationDetector`, `DiskFailureDetector`, `KafkaBrokerFailureDetector` etc. are used for the anomaly detection, which runs periodically to check if the cluster have their corresponding anomalies or not (This periodic check can be easily configured through the `anomaly.detection.interval.ms` configuration).
+              Every detector class works in a different way to detect their corresponding anomalies. A `KafkaBrokerFailureDetector` will require a deep usage of requesting metadata to the Kafka cluster (i.e. broker failure) while if we talk about `DiskFailureDetector` it will be more inclined towards using the admin client API. In the same way the `MetricAnomalyDetector` will be making use of metrics and other detector like `GoalViolationDetector` and `TopicAnomalyDetector` will have their own way to detect anomalies.
+              The detected anomalies can be of various types:
+              * Goal Violation - This happens if certain optimization goals are violated (e.g. DiskUsageDistributionGoal etc.). These goals can be configured independently (through the `self.healing.goals` config) to those used for manual rebalancing..

Member

kyguy Jan 29, 2025

Suggested change

      
            * Goal Violation - This happens if certain optimization goals are violated (e.g. DiskUsageDistributionGoal etc.). These goals can be configured independently (through the `self.healing.goals` config) to those used for manual rebalancing..
          
            * Goal Violation - This happens if certain optimization goals are violated (e.g. DiskUsageDistributionGoal etc.). These goals can be configured independently (through the `self.healing.goals` config) from those used for manual rebalancing.

The nuances between the goal configurations of Cruise Control can be quite confusing, to help keep them straight in my head I categorize them in the following way:

The server-side Cruise Control goal configs in spec.cruiseControl.config of the Kafka resource
The client-side Cruise Control goal config in spec of the KafkaRebalance resource

To avoid confusion, it may be worth adding some text that explains where and how the self.healing.goals compares/relates to the other server-side Cruise Control goal configs - goals, default.goals, hard.goals, anomaly.detection.goals. It wouldn't have to be right here, but somewhere in the proposal where the self.healing configuration is raised. When we add the self-healing feature we will need to add some documentation of how self.healing.goals related to the other goal configurations anyways. We could use the same language used to describe the different goal configs in the Strimzi documentation and provide a link to those docs as well.

I think the language you and Paolo use in this thread referring to other goal configs like "superset" or subset is helpful. Since the goal configs are so complicated, I don't think it is a problem if we reuse text from the Strimzi Cruise Control goal documentation to emphasize the distinction of the self.healing.goals config.

092-self-healing-using-cruise-control.md Outdated

+              * Metric anomaly - This failure happens if metrics collected by Cruise Control have some anomaly in their value (e.g. a sudden rise in the log flush time metrics).
+              The detected anomalies are inserted into a priority queue and the anomaly with the highest property is picked first to resolve.
+              The anomaly detector manager calls the notifier to get an action regarding, whether the anomaly should be fixed, delayed or ignored.

Member

kyguy Jan 30, 2025

Suggested change

      
            The anomaly detector manager calls the notifier to get an action regarding, whether the anomaly should be fixed, delayed or ignored.
          
            The anomaly detector manager calls the notifier to get an action regarding whether the anomaly should be fixed, delayed, or ignored.

092-self-healing-using-cruise-control.md Outdated

+              The detected anomalies are inserted into a priority queue and the anomaly with the highest property is picked first to resolve.
+              The anomaly detector manager calls the notifier to get an action regarding, whether the anomaly should be fixed, delayed or ignored.
+              If the action is fix, then the anomaly detector manager calls the classes that are required to resolve the anomaly.

Member

kyguy Jan 30, 2025

Suggested change

      
            If the action is fix, then the anomaly detector manager calls the classes that are required to resolve the anomaly.
          
            If the action is fixed, then the anomaly detector manager calls the classes that are required to resolve the anomaly.

092-self-healing-using-cruise-control.md Outdated

+              The detected anomalies are inserted into a priority queue and the anomaly with the highest property is picked first to resolve.
+              The anomaly detector manager calls the notifier to get an action regarding, whether the anomaly should be fixed, delayed or ignored.
+              If the action is fix, then the anomaly detector manager calls the classes that are required to resolve the anomaly.

Member

kyguy Jan 30, 2025

Suggested change

      
            If the action is fix, then the anomaly detector manager calls the classes that are required to resolve the anomaly.
          
            If the action is fixed, then the anomaly detector manager calls the classes that are required to resolve the anomaly.

092-self-healing-using-cruise-control.md Outdated


		### Enabling self-healing in the Kafka custom resource

		When the feature gate is enabled, the operator will set the `self.healing.enabled` Cruise Control configuration to `true` and allow the users to set other appropriate `self-healing` related [configurations](https://github.com/linkedin/cruise-control/wiki/Configurations#selfhealingnotifier-configurations).

Member

kyguy Jan 30, 2025

Suggested change

      
            When the feature gate is enabled, the operator will set the `self.healing.enabled` Cruise Control configuration to `true` and allow the users to set other appropriate `self-healing` related [configurations](https://github.com/linkedin/cruise-control/wiki/Configurations#selfhealingnotifier-configurations).
          
            When the `UseSelfHealing` feature gate is enabled, users will be able to activate self-healing by configuring the [`self.healing` Cruise Control configurations](https://github.com/linkedin/cruise-control/wiki/Configurations#selfhealingnotifier-configurations) in the `spec.cruiseControl.config` of their `Kafka` resource.
          
            When the `UseSelfHealing` feature gate is disabled, users will not be able to activate self healing and all `self.healing` Cruise Control configurations will be ignored.

092-self-healing-using-cruise-control.md Outdated

+                    self.healing.enabled: true
+              ```
+              Where the user wants to enable self-healing for some specific anomalies, they can use configurations like `self.healing.broker.failure.enabled` for broker failure, `self.healing.goal.violation.enabled` for goal violation etc.

Member

kyguy Jan 30, 2025

If the user doesn't specify any specific anomalies(i.e you used just self.healing.enabled: true) then self healing would be detecting all the anomalies.

I wonder if it would be worth having all anomaly detection disabled by default and force users to explicitly enable which specific anomalies they want to detect and have fixed automatically. I imagine the alerting of some anomalies might be pretty noisy and the automatic fixes may be quite taxing on the cluster, having all anomalies enabled at once by default might surprise and overload users not familiar with the feature.

092-self-healing-using-cruise-control.md Outdated

+              Cruise Control provides `AnomalyNotifier` interface which has multiple abstract methods on what to do if certain anomalies are detected.
+              `SelfHealingNotifier` is the base class that contains the logic of self-healing and override the methods in by implementing the `AnomalyNotifier` interface.
+              Some of those methods are:`onGoalViolation()`, `onBrokerFailure()`, `onDisk Failure`, `alert()` etc.

Member

kyguy Jan 30, 2025

Suggested change

      
            Some of those methods are:`onGoalViolation()`, `onBrokerFailure()`, `onDisk Failure`, `alert()` etc.
          
            Some of those methods are:`onGoalViolation()`, `onBrokerFailure()`, `onDiskFailure()`, `alert()` etc.

092-self-healing-using-cruise-control.md Outdated

+              `SelfHealingNotifier` is the base class that contains the logic of self-healing and override the methods in by implementing the `AnomalyNotifier` interface.
+              Some of those methods are:`onGoalViolation()`, `onBrokerFailure()`, `onDisk Failure`, `alert()` etc.
+              The `SelfHealingNotifier` class can then be further used to implement your own notifier
+              The other custom notifier offered by Cruise control are also based upon the `SelfHealingNotifier`.

Member

kyguy Jan 30, 2025

Suggested change

      
            The other custom notifier offered by Cruise control are also based upon the `SelfHealingNotifier`.
          
            The other custom notifiers offered by Cruise control are also based upon the `SelfHealingNotifier`.

092-self-healing-using-cruise-control.md Outdated


		### Strimzi Operations while self-healing is running/fixing an anomaly

		Self-healing is a very robust and resilient feature due to its ability to tackle issues which can appear while the healing of cluster is happening.

Member

kyguy Jan 30, 2025

Suggested change

      
            Self-healing is a very robust and resilient feature due to its ability to tackle issues which can appear while the healing of cluster is happening.
          
            Self-healing is a very robust and resilient feature due to its ability to tackle issues while healing a cluster.


          Added suggestions

Signed-off-by: ShubhamRwt <shubhamrwt02@gmail.com>

tinaselenge reviewed

View reviewed changes

092-self-healing-using-cruise-control.md

               #### What happens when self-healing is running/fixing an anomaly and brokers are rolled
-              If a broker fails, self-healing will trigger and Cruise Control will try to move the partition replicas from the failed broker to other healthy brokers in the Kafka cluster.
-              If, during the process of moving partitions from the failed broker, another broker in the Kafka cluster gets deleted or rolled, then Cruise Control would finish the transfer from the failed broker but log that `self-healing finished successful with error`.
+              If some anomaly is getting healed , Cruise Control will then try to fix/mitigate the anomaly by moving the partition replicas across the brokers in the Kafka cluster.

Contributor

tinaselenge Mar 12, 2025

When an anomaly is detected, Cruise Control may try to fix/mitigate it by moving the partition replicas across the brokers in the Kafka cluster.

092-self-healing-using-cruise-control.md Outdated

    
              The detected anomalies can be of various types:

              * Goal Violation - This happens if certain optimization goals are violated (e.g. DiskUsageDistributionGoal etc.). These goals can be configured independently (through the `self.healing.goals` config) to those used for manual rebalancing..

              * Goal Violation - This happens if certain optimization goals are violated (e.g. DiskUsageDistributionGoal etc.). These goals can be configured independently (through the `self.healing.goals` configuration under the `spec.cruiseControl.config` section) to those used for manual rebalancing(goals configured in the KafkaRebalance custom resource).

Contributor

tinaselenge Mar 12, 2025

I think this part confuses me slightly, maybe can be reworded better.
Is the below what you mean?

Goal Violation - This happens if certain optimization goals such as DiskUsageDistributionGoal are violated. The optimization goals can be configured independently from KafkaRebalance custom resource by using spec.cruiseControl.config.self.healing.goals in Kafka custom resource.

When you say certain optimization goals, are you saying only some would trigger Goal Violation anomaly?

tomncooper Mar 18, 2025

Yes, AFAIK it would only be those goals specifically listed in the self healing goals configuration.

Contributor Author

ShubhamRwt Mar 19, 2025

I tried to make the sentence less confusing, I hope its better now

092-self-healing-using-cruise-control.md Outdated

@@ @@ -76,17 +79,18 @@ statusUpdateDate=2024-12-09T09:49:47Z, @@
               failedDisksByTimeMs={2={/var/lib/kafka/data-1/kafka-log2=1733737786603}},
               status=FIX_STARTED}]}
               ```
+              where the `anomalyID` represents the task id assigned to the anomaly, `failedDisksByTimeMs` represents the disk which failed, `detectionDate` and `statusUpdateDate` represet the date and time when the anomaly was detected and the status of the anomaly was updated.

Contributor

tinaselenge Mar 12, 2025

I don't see the mentioned anomalyType field snippet. Is this a full snippet of an anomaly?

I guess not that important, since this is just an example, but should we explain the value of failedDisksByTimeMs? {2={/var/lib/kafka/data-1/kafka-log2=1733737786603}} what these value mean, how it represents the failed disk? If you think, this is not important, please feel free ignore this part.

Contributor Author

ShubhamRwt Mar 19, 2025

I have added some explanation around it

092-self-healing-using-cruise-control.md

    
              * COMPLETENESS_NOT_READY - Completeness not met for some goals. The completeness describes the confidence level of the data in the metric sample aggregator. If this is low then some Goal classes may refuse to produce proposals.

              We can also poll the `state` endpoint, with `substate` parameter as `executor`, to get details on the tasks that are currently being performed by Cruise Control.

              We can again poll the [`state` endpoint](https://github.com/linkedin/cruise-control/wiki/REST-APIs#query-the-state-of-cruise-control), with `substate` parameter as `executor`, to get details on the tasks that are currently being performed by Cruise Control.

Contributor

tinaselenge Mar 12, 2025

Will we provide, how users might poll this end point?

Contributor Author

ShubhamRwt Mar 19, 2025

By this you mean mentioning the complete curl command?

Contributor Author

ShubhamRwt Mar 19, 2025

I have added the curl command to poll the endpoint

092-self-healing-using-cruise-control.md

		@@ -177,13 +183,13 @@ The event published would look like this:
		type: Normal
		```

Contributor

tinaselenge Mar 12, 2025

#152 - would this proposal have an impact on the format of the Kubernetes event for consistency?

Contributor Author

ShubhamRwt Mar 19, 2025

Not sure, the notifier would reside as a separate repo under Strimzi, but I can adapt the events to look like this in the proposal. Wdyt @ppatierno?


          Refined the proposal

656e291

Signed-off-by: ShubhamRwt <shubhamrwt02@gmail.com>

# for free to join this conversation on GitHub. Already have an account? # to comment

Reviewers

scholzj scholzj left review comments

ppatierno ppatierno left review comments

kyguy kyguy left review comments

tinaselenge tinaselenge left review comments

tombentley Awaiting requested review from tombentley

tomncooper Awaiting requested review from tomncooper

At least 1 approving review is required to merge this pull request.

Labels

None yet