Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Proposal for adding Self healing feature in the operator #145

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

ShubhamRwt
Copy link
Contributor

This proposal shows how we plan to implement the self healing feature in the operator

Signed-off-by: ShubhamRwt <shubhamrwt02@gmail.com>
Copy link
Member

@scholzj scholzj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the proposal.

I think it would be useful to use a regular human language here and not the Cruise Control language as that would bring better perspective on what exactly is being proposed here and what is not being proposed here (and what actually the self-healing does and does not do).

I do not understand the role of the StrimziNotifier. While I doubt that anyone will find it appealing to have Kubernetes events used for the notification instead of Slack or Prometheus, I cannot of course rule out that this will find its users. But it seems to be only marginally related to Strimzi. I'm sure we can host this under the Strimzi GitHub organization if desired, but it seems more like an unrelated subproject that is separate from the self-healing feature it self.

The actual self-healing ... you focus a lot on the Disk and Broker failures. But it seems to me like the self-healing does not offer good solutions for this. The idea to solve this by rescheduling the partition replicas to other brokers seems to be well suited for Kafka on bare-metal with local storage where a dead server often means that you have to get a brand new server to replace it. It does not seem to be well suited to self-healing Kubernetes infrastructure and for cloud-based environments, where Pods can be just rescheduled and new disk can be provisioned in the matter of seconds.

So it seems to me like the self-healing itself as done by Cruise Control could be useful for dealing with inbalanced brokers. But for situations such as broker or disk failures or insufficient capacity, we should be more ambitious and aim to solve them properly in the Strimzi Cluster Operator. That is where an actual StrimziNotifier might play a future role to not issue any Kubernetes Events to tell the user about broker disk, but provide a mechanism to tell Strimzi about it so that Strimzi can react to it - e.g. by scaling for capacity based recommendation from CC or as suggested delete broken PVC and get a new one provisioned to replace it. While I underttand that this might be out of scope for this proposal, I think it is important to leave space for it. For example by focusing here on the auto-rebalancing features and leaving the other detected issues unused for future development.

## Current situation

During normal operations it's common for Kafka clusters to become unbalanced (through partition key skew, uneven partition placement etc.) or for problems to occur with disks or other underlying hardware which makes the cluster unhealthy.
Currently, if we encounter any such scenario we need to fix these issues manually i.e. if there is some broker failure then we might move the partition replicas from that corrupted broker to some other healthy broker by using the `KafkaRebalance` custom resource in the [`remove-broker`](https://strimzi.io/docs/operators/latest/full/deploying.html#proc-generating-optimization-proposals-str) mode.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not some bare-metal cluster but Kubernetes. So I'm struggling to find the issues where you would need to do what you suggest. Instead, on Kubernetes, you would normally just fix the issue on the existing broker.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see your point. At Kuberntes level, maybe a broker failure is more related to lack of resources which you should fix at cluster level and not removing the broker. A better example could be a disk failure on a broker or some goal violation (which is the usual reason why we use manual rebalancing).


During normal operations it's common for Kafka clusters to become unbalanced (through partition key skew, uneven partition placement etc.) or for problems to occur with disks or other underlying hardware which makes the cluster unhealthy.
Currently, if we encounter any such scenario we need to fix these issues manually i.e. if there is some broker failure then we might move the partition replicas from that corrupted broker to some other healthy broker by using the `KafkaRebalance` custom resource in the [`remove-broker`](https://strimzi.io/docs/operators/latest/full/deploying.html#proc-generating-optimization-proposals-str) mode.
With smaller clusters, it is feasible to fix things manually. However, for larger ones it can be very time-consuming, or just not feasible, to fix all the issue on your own.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should clarify what issues are you talking about here. From the past discussion, the terminology used here seems highly confusing ... and I suspect that the words such as issues, anomalies, and problems are ambiguous at best`.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For me they are all synonyms more or less. Maybe just use one and stick with it?


## Motivation

With the help of the self-healing feature we can resolve the issues like disk failure, broker failure and other issues automatically.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing in this proposal suggests that Cruise Control can solve disk or broker failures.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAICS, the proposal introduces the concept of broker/disk failure detectors and related fix class which highlight that CC can solve them. Or what are you looking for?


With the help of the self-healing feature we can resolve the issues like disk failure, broker failure and other issues automatically.
Cruise Control treats these issues as "anomalies", the detection of which is the responsibility of the anomaly detector mechanism.
Currently, users are [able to set]((https://strimzi.io/docs/operators/latest/full/deploying.html#setting_up_alerts_for_anomaly_detection)) the notifier to one of those included with Cruise Control (<list of included notifiers>).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The link seems broken? (Double ( and )?)

The anomaly detector manager helps in detecting the anomalies and also handling the detected anomalies.
It acts as a coordinator between the detector classes as well as the classes which will be handling and resolving the anomalies.
Various detector classes like `GoalViolationDetector`, `DiskFailureDetector`, `KafkaBrokerFailureDetector` etc. are used for the anomaly detection, which runs periodically to check if the cluster have their corresponding anomalies or not (This periodic check can be easily configured through the `anomaly.detection.interval.ms` configuration).
Every detector class works in a different way to detect their corresponding anomalies. A `KafkaBrokerFailureDetector` will require a deep usage of requesting metadata to the Kafka cluster (i.e. broker failure) while if we talk about `DiskFailureDetector` it will be more inclined towards using the admin client API. In the same way the `MetricAnomalyDetector` will be making use of metrics and other detector like `GoalViolationDetector` and `TopicAnomalyDetector` will have their own way to detect anomalies.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what exactly it is trying to say ... but it does not seem to make sense to me.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if we need this level of details. Maybe just link to the doc of these detectors? If not, maybe reword the sentence to something like this:

Suggested change
Every detector class works in a different way to detect their corresponding anomalies. A `KafkaBrokerFailureDetector` will require a deep usage of requesting metadata to the Kafka cluster (i.e. broker failure) while if we talk about `DiskFailureDetector` it will be more inclined towards using the admin client API. In the same way the `MetricAnomalyDetector` will be making use of metrics and other detector like `GoalViolationDetector` and `TopicAnomalyDetector` will have their own way to detect anomalies.
Detector classes have different mechanisms to detect their corresponding anomalies. For example, `KafkaBrokerFailureDetector` utilises Kafka Metadata API whereas `DiskFailureDetector` utilises Kafka Admin API. Furthermore, `MetricAnomalyDetector` uses metrics while other detector such as `GoalViolationDetector` and `TopicAnomalyDetector` use a different mechanism to detect anomalies.

Maybe we should mention briefly how GoalViolationDetector and TopicAnomalyDetector detect anomalies like these others too.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what exactly it is trying to say ... but it does not seem to make sense to me.

What the text is trying to say is that each detector has its own way to detect the anomaly (i.e. using admin client API, or deep metadata API request, or reading metrics and so on). Maybe the changes proposed by @tinaselenge makes it clearer?

```

The `StrimziNotifier` class is going to reside in a separate module in the operator repository called the `strimzi-cruise-control-notifier` so that a seperate jar for the notifier can be built for inclusion in the Cruise Control container image.
We will also need to include the fabric8 client dependencies in the notifier JAR that will be used for generating the Kubernetes events as well as for reading the annotation on the Kafka resource from the notifier.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading what annotation?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think you are anticipating what's going to be in the next paragraph but this sentence doesn't make any sense here.

Comment on lines 185 to 190
The user should also have the option to pause the self-healing feature, in case they know or plan to do some rolling restart of the cluster or some rebalance they are going to perform which can cause issue with the self-healing itself.
For this, we will be having an annotation called `strimzi.io/self-healing-paused`.
We can set this annotation on the Kafka custom resource to pause the self-healing feature.
self-healing is not paused by default so missing annotation would mean that anomalies detected will keep getting healed automatically.
When self-healing is paused, it would mean that all the anomalies that are detected after the pause annotation is applied would be ignored and no fix would be done for them.
The anomalies that were already getting fixed would continue with the fix.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is pausing the self-healing useful for the users? Assuming it is useful, why not just disable the self-healing?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Disabling self-healing would mean restarting CC. Pausing mean that the notifier just returns "ignore" to the anomaly detector (but you could still see anomalies logged in the CC pod for example).


#### What happens when self-healing is running/fixing an anomaly and brokers are rolled

If a broker fails, self-healing will trigger and Cruise Control will try to move the partition replicas from the failed broker to other healthy brokers in the Kafka cluster.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What exactly does it mean when the broker fails? How does CC recognize it? How does it distinguish failure for example from some temporary Kubernetes scheduling issue?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The broker anomaly detector uses deep metadata request to the Kafka cluster to know the brokers status. The broker failure in particular is not fixed straight away but it's first delayed in order to double check "later" if the broker is still offline or not. All the timing between double checks are configurable. So any temporary issue will not be reported as anomaly and tried to be fixed.

#### What happens when self-healing is running/fixing an anomaly and brokers are rolled

If a broker fails, self-healing will trigger and Cruise Control will try to move the partition replicas from the failed broker to other healthy brokers in the Kafka cluster.
If, during the process of moving partitions from the failed broker, another broker in the Kafka cluster gets deleted or rolled, then Cruise Control would finish the transfer from the failed broker but log that `self-healing finished successful with error`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would it finish the transfer from the failed broker when the target broker is stopped? You mean that it will continue with it after the target broker is rolled?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, If the target broker is rolling at that moment, then self healing would end saying self-healing finished successful with error. Then periodic anomaly check will again detect the failed broker and try to fix it. If the target broker is rolled by that time then self healing will start again else if the rolled broker is not up then we will will get two anomalies to fix the failed broker and the target broker


#### What happens when self-healing is running/fixing an anomaly and Kafka Rebalance is applied

If self-healing is ongoing and a `KafkaRebalance` resource is posted in the middle of it then the `KafkaRebalance` resource will move to `NotReady` state, stating that there is already a rebalance ongoing.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to differenciate between asking for proposal and approving a proposal? For example when asking for proposal, I would expect the operator to wait and give me a proposal after the self-healing ends.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the following snippet I can see the dryrun=false so this seems a condition happening when you approve a proposal. @ShubhamRwt does it mean that during self-healing, CC was able to return to you a proposal and you were able to approve it and then getting the error?


#### Anomaly Detector Manager

The anomaly detector manager helps in detecting the anomalies and also handling the detected anomalies.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The anomaly detector manager helps in detecting the anomalies and also handling the detected anomalies.
The anomaly detector manager helps in detecting the anomalies as well as handling them.

Various detector classes like `GoalViolationDetector`, `DiskFailureDetector`, `KafkaBrokerFailureDetector` etc. are used for the anomaly detection, which runs periodically to check if the cluster have their corresponding anomalies or not (This periodic check can be easily configured through the `anomaly.detection.interval.ms` configuration).
Every detector class works in a different way to detect their corresponding anomalies. A `KafkaBrokerFailureDetector` will require a deep usage of requesting metadata to the Kafka cluster (i.e. broker failure) while if we talk about `DiskFailureDetector` it will be more inclined towards using the admin client API. In the same way the `MetricAnomalyDetector` will be making use of metrics and other detector like `GoalViolationDetector` and `TopicAnomalyDetector` will have their own way to detect anomalies.
The detected anomalies can be of various types:
* Goal Violation - This happens if certain optimization goals are violated (e.g. DiskUsageDistributionGoal etc.). These goals can be configured independently (through the `self.healing.goals` config) to those used for manual rebalancing..
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you mean the goals in KafkaRebalance and goals self.healing.goals are independent of each other?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you have specified hard goals in the KR, then the self.healing.goals should be a superset of the hard goals.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you have specified hard goals in the KR, then the self.healing.goals should be a superset of the hard goals.

If by KR you mean the KafkaRebalance custom resource, I don't think what you are saying is true. There is no relationship between a specific KafkaRebalance and self-healing configuration.
Maybe are you referring to the general CC configuration? So the default hard goals in CC has to be a superset of self.healing.goals (and not the other way around)?

The anomaly detector manager calls the notifier to get an action regarding, whether the anomaly should be fixed, delayed or ignored.
If the action is fix, then the anomaly detector manager calls the classes that are required to resolve the anomaly.

Anomaly detection also has several other [configurations](https://github.com/linkedin/cruise-control/wiki/Configurations#anomalydetector-configurations), such as the detection interval and the anomaly notifier class, which can effect the performance of the Cruise Control server and the latency of the anomaly detection.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Anomaly detection also has several other [configurations](https://github.com/linkedin/cruise-control/wiki/Configurations#anomalydetector-configurations), such as the detection interval and the anomaly notifier class, which can effect the performance of the Cruise Control server and the latency of the anomaly detection.
Anomaly detection also has various [configurations](https://github.com/linkedin/cruise-control/wiki/Configurations#anomalydetector-configurations), such as the detection interval and the anomaly notifier class, which can affect the performance of the Cruise Control server and the latency of the anomaly detection.

Cruise Control also provides [custom notifiers](https://github.com/linkedin/cruise-control/wiki/Configure-notifications) like Slack Notifier, Alerta Notifier etc. for notifying users regarding the anomalies. There are multiple other [self-healing notifier](https://github.com/linkedin/cruise-control/wiki/Configurations#selfhealingnotifier-configurations) related configurations you can use to make notifiers more efficient as per the use case.

#### Self Healing
If self-healing is enabled in Cruise Control and anomalies are detected, then a fix for these will be attempted automatically.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A fix will be attempted automatically only if anomaly.notifier.class is configured to non default notifier right? So enabling self-healing alone is not enough, as with the default notifier (NoopNotifer), anomalies are detected but ignored?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you will need to override the notifier too

The anomaly status would change based on how the healing progresses. The anomaly can transition to these possible status:
* DETECTED - The anomaly is just detected and no action is yet taken on it
* IGNORED - Self healing is either disabled or the anomaly is unfixable
* FIX_STARTED - The anomaly has started getting fixed
Copy link
Contributor

@tinaselenge tinaselenge Jan 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* FIX_STARTED - The anomaly has started getting fixed
* FIX_STARTED - The anomaly fix has started

* FIX_STARTED - The anomaly has started getting fixed
* FIX_FAILED_TO_START - The proposal generation for the anomaly fix failed
* CHECK_WITH_DELAY - The anomaly fix is delayed
* LOAD_MONITOR_NOT_READY - Monitor which is monitoring the workload of a Kafka cluster is in loading or bootstrap state
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* LOAD_MONITOR_NOT_READY - Monitor which is monitoring the workload of a Kafka cluster is in loading or bootstrap state
* LOAD_MONITOR_NOT_READY - Monitor for workload of a Kafka cluster is in loading or bootstrap state

self.healing.enabled: true
```

Where the user wants to enable self-healing for some specific anomalies, they can use configurations like `self.healing.broker.failure.enabled` for broker failure, `self.healing.goal.violation.enabled` for goal violation etc.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any anomalies enabled by default? If user does not specify anomalies to detect, self-healing will do nothing?

Copy link
Contributor Author

@ShubhamRwt ShubhamRwt Jan 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the user doesn't specify any specific anomalies(i.e you used just self.healing.enabled: true) then self healing would be detecting all the anomalies. In case the user knows that somehow any anomaly will keep appearing since they have some issues in their cluster, then in that they can use skip the check for that particular anomaly

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the user doesn't specify any specific anomalies(i.e you used just self.healing.enabled: true) then self healing would be detecting all the anomalies.

I wonder if it would be worth having all anomaly detection disabled by default and force users to explicitly enable which specific anomalies they want to detect and have fixed automatically. I imagine the alerting of some anomalies might be pretty noisy and the automatic fixes may be quite taxing on the cluster, having all anomalies enabled at once by default might surprise and overload users not familiar with the feature.

The notifier class returns the `action` that is going to be taken on the notified anomaly.
These actions can be classified into three types:
* FIX - Start the anomaly fix
* CHECK - Delay the anomaly fix
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does the delay action, apply a configured delay before taking the action provided by notifier?

Copy link
Member

@ppatierno ppatierno left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some comments. Can you go through the proposal and split paragraphs with one sentence per line? Thanks!

## Current situation

During normal operations it's common for Kafka clusters to become unbalanced (through partition key skew, uneven partition placement etc.) or for problems to occur with disks or other underlying hardware which makes the cluster unhealthy.
Currently, if we encounter any such scenario we need to fix these issues manually i.e. if there is some broker failure then we might move the partition replicas from that corrupted broker to some other healthy broker by using the `KafkaRebalance` custom resource in the [`remove-broker`](https://strimzi.io/docs/operators/latest/full/deploying.html#proc-generating-optimization-proposals-str) mode.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see your point. At Kuberntes level, maybe a broker failure is more related to lack of resources which you should fix at cluster level and not removing the broker. A better example could be a disk failure on a broker or some goal violation (which is the usual reason why we use manual rebalancing).


During normal operations it's common for Kafka clusters to become unbalanced (through partition key skew, uneven partition placement etc.) or for problems to occur with disks or other underlying hardware which makes the cluster unhealthy.
Currently, if we encounter any such scenario we need to fix these issues manually i.e. if there is some broker failure then we might move the partition replicas from that corrupted broker to some other healthy broker by using the `KafkaRebalance` custom resource in the [`remove-broker`](https://strimzi.io/docs/operators/latest/full/deploying.html#proc-generating-optimization-proposals-str) mode.
With smaller clusters, it is feasible to fix things manually. However, for larger ones it can be very time-consuming, or just not feasible, to fix all the issue on your own.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For me they are all synonyms more or less. Maybe just use one and stick with it?


## Motivation

With the help of the self-healing feature we can resolve the issues like disk failure, broker failure and other issues automatically.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAICS, the proposal introduces the concept of broker/disk failure detectors and related fix class which highlight that CC can solve them. Or what are you looking for?


With the help of the self-healing feature we can resolve the issues like disk failure, broker failure and other issues automatically.
Cruise Control treats these issues as "anomalies", the detection of which is the responsibility of the anomaly detector mechanism.
Currently, users are [able to set]((https://strimzi.io/docs/operators/latest/full/deploying.html#setting_up_alerts_for_anomaly_detection)) the notifier to one of those included with Cruise Control (<list of included notifiers>).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the first time you mention the term "notifier" but it shows up out of the context. You should first explain what it is and why it's useful in general, not related to the self-healing but more to the alerting.

The anomaly detector manager helps in detecting the anomalies and also handling the detected anomalies.
It acts as a coordinator between the detector classes as well as the classes which will be handling and resolving the anomalies.
Various detector classes like `GoalViolationDetector`, `DiskFailureDetector`, `KafkaBrokerFailureDetector` etc. are used for the anomaly detection, which runs periodically to check if the cluster have their corresponding anomalies or not (This periodic check can be easily configured through the `anomaly.detection.interval.ms` configuration).
Every detector class works in a different way to detect their corresponding anomalies. A `KafkaBrokerFailureDetector` will require a deep usage of requesting metadata to the Kafka cluster (i.e. broker failure) while if we talk about `DiskFailureDetector` it will be more inclined towards using the admin client API. In the same way the `MetricAnomalyDetector` will be making use of metrics and other detector like `GoalViolationDetector` and `TopicAnomalyDetector` will have their own way to detect anomalies.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what exactly it is trying to say ... but it does not seem to make sense to me.

What the text is trying to say is that each detector has its own way to detect the anomaly (i.e. using admin client API, or deep metadata API request, or reading metrics and so on). Maybe the changes proposed by @tinaselenge makes it clearer?

### Pausing the self-healing feature

The user should also have the option to pause the self-healing feature, in case they know or plan to do some rolling restart of the cluster or some rebalance they are going to perform which can cause issue with the self-healing itself.
For this, we will be having an annotation called `strimzi.io/self-healing-paused`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
For this, we will be having an annotation called `strimzi.io/self-healing-paused`.
For this, we will be having an annotation called `strimzi.io/self-healing-paused` on the `Kafka` custom resource.

Comment on lines 185 to 190
The user should also have the option to pause the self-healing feature, in case they know or plan to do some rolling restart of the cluster or some rebalance they are going to perform which can cause issue with the self-healing itself.
For this, we will be having an annotation called `strimzi.io/self-healing-paused`.
We can set this annotation on the Kafka custom resource to pause the self-healing feature.
self-healing is not paused by default so missing annotation would mean that anomalies detected will keep getting healed automatically.
When self-healing is paused, it would mean that all the anomalies that are detected after the pause annotation is applied would be ignored and no fix would be done for them.
The anomalies that were already getting fixed would continue with the fix.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Disabling self-healing would mean restarting CC. Pausing mean that the notifier just returns "ignore" to the anomaly detector (but you could still see anomalies logged in the CC pod for example).

uid: 71928731-df71-4b9f-ac77-68961ac25a2f
```

We will have checks on methods like `onGoalViolation`, `onBrokerFailure`, `onTopicAnomaly` etc. in the `StrimziNotifier` to check if the `pause` annotation is applied on the Kafka custom resource or not.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is no pause annotation. Use the exact name.


#### What happens when self-healing is running/fixing an anomaly and brokers are rolled

If a broker fails, self-healing will trigger and Cruise Control will try to move the partition replicas from the failed broker to other healthy brokers in the Kafka cluster.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The broker anomaly detector uses deep metadata request to the Kafka cluster to know the brokers status. The broker failure in particular is not fixed straight away but it's first delayed in order to double check "later" if the broker is still offline or not. All the timing between double checks are configurable. So any temporary issue will not be reported as anomaly and tried to be fixed.


#### What happens when self-healing is running/fixing an anomaly and Kafka Rebalance is applied

If self-healing is ongoing and a `KafkaRebalance` resource is posted in the middle of it then the `KafkaRebalance` resource will move to `NotReady` state, stating that there is already a rebalance ongoing.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the following snippet I can see the dryrun=false so this seems a condition happening when you approve a proposal. @ShubhamRwt does it mean that during self-healing, CC was able to return to you a proposal and you were able to approve it and then getting the error?


The anomaly detector manager helps in detecting the anomalies and also handling the detected anomalies.
It acts as a coordinator between the detector classes as well as the classes which will be handling and resolving the anomalies.
Various detector classes like `GoalViolationDetector`, `DiskFailureDetector`, `KafkaBrokerFailureDetector` etc. are used for the anomaly detection, which runs periodically to check if the cluster have their corresponding anomalies or not (This periodic check can be easily configured through the `anomaly.detection.interval.ms` configuration).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Various detector classes like `GoalViolationDetector`, `DiskFailureDetector`, `KafkaBrokerFailureDetector` etc. are used for the anomaly detection, which runs periodically to check if the cluster have their corresponding anomalies or not (This periodic check can be easily configured through the `anomaly.detection.interval.ms` configuration).
Various detector classes like `GoalViolationDetector`, `DiskFailureDetector`, `KafkaBrokerFailureDetector` etc. are used for the anomaly detection, which runs periodically to check if the cluster have their corresponding anomalies or not (The frequency of this check can be easily configured through the `anomaly.detection.interval.ms` configuration).

Various detector classes like `GoalViolationDetector`, `DiskFailureDetector`, `KafkaBrokerFailureDetector` etc. are used for the anomaly detection, which runs periodically to check if the cluster have their corresponding anomalies or not (This periodic check can be easily configured through the `anomaly.detection.interval.ms` configuration).
Every detector class works in a different way to detect their corresponding anomalies. A `KafkaBrokerFailureDetector` will require a deep usage of requesting metadata to the Kafka cluster (i.e. broker failure) while if we talk about `DiskFailureDetector` it will be more inclined towards using the admin client API. In the same way the `MetricAnomalyDetector` will be making use of metrics and other detector like `GoalViolationDetector` and `TopicAnomalyDetector` will have their own way to detect anomalies.
The detected anomalies can be of various types:
* Goal Violation - This happens if certain optimization goals are violated (e.g. DiskUsageDistributionGoal etc.). These goals can be configured independently (through the `self.healing.goals` config) to those used for manual rebalancing..
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Goal Violation - This happens if certain optimization goals are violated (e.g. DiskUsageDistributionGoal etc.). These goals can be configured independently (through the `self.healing.goals` config) to those used for manual rebalancing..
* Goal Violation - This happens if certain optimization goals are violated (e.g. DiskUsageDistributionGoal etc.). These goals can be configured independently (through the `self.healing.goals` config) from those used for manual rebalancing.

The nuances between the goal configurations of Cruise Control can be quite confusing, to help keep them straight in my head I categorize them in the following way:

  • The server-side Cruise Control goal configs in spec.cruiseControl.config of the Kafka resource
  • The client-side Cruise Control goal config in spec of the KafkaRebalance resource

To avoid confusion, it may be worth adding some text that explains where and how the self.healing.goals compares/relates to the other server-side Cruise Control goal configs - goals, default.goals, hard.goals, anomaly.detection.goals. It wouldn't have to be right here, but somewhere in the proposal where the self.healing configuration is raised. When we add the self-healing feature we will need to add some documentation of how self.healing.goals related to the other goal configurations anyways. We could use the same language used to describe the different goal configs in the Strimzi documentation and provide a link to those docs as well.

I think the language you and Paolo use in this thread referring to other goal configs like "superset" or subset is helpful. Since the goal configs are so complicated, I don't think it is a problem if we reuse text from the Strimzi Cruise Control goal documentation to emphasize the distinction of the self.healing.goals config.

* Metric anomaly - This failure happens if metrics collected by Cruise Control have some anomaly in their value (e.g. a sudden rise in the log flush time metrics).

The detected anomalies are inserted into a priority queue and the anomaly with the highest property is picked first to resolve.
The anomaly detector manager calls the notifier to get an action regarding, whether the anomaly should be fixed, delayed or ignored.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The anomaly detector manager calls the notifier to get an action regarding, whether the anomaly should be fixed, delayed or ignored.
The anomaly detector manager calls the notifier to get an action regarding whether the anomaly should be fixed, delayed, or ignored.


The detected anomalies are inserted into a priority queue and the anomaly with the highest property is picked first to resolve.
The anomaly detector manager calls the notifier to get an action regarding, whether the anomaly should be fixed, delayed or ignored.
If the action is fix, then the anomaly detector manager calls the classes that are required to resolve the anomaly.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
If the action is fix, then the anomaly detector manager calls the classes that are required to resolve the anomaly.
If the action is fixed, then the anomaly detector manager calls the classes that are required to resolve the anomaly.


The detected anomalies are inserted into a priority queue and the anomaly with the highest property is picked first to resolve.
The anomaly detector manager calls the notifier to get an action regarding, whether the anomaly should be fixed, delayed or ignored.
If the action is fix, then the anomaly detector manager calls the classes that are required to resolve the anomaly.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
If the action is fix, then the anomaly detector manager calls the classes that are required to resolve the anomaly.
If the action is fixed, then the anomaly detector manager calls the classes that are required to resolve the anomaly.


### Enabling self-healing in the Kafka custom resource

When the feature gate is enabled, the operator will set the `self.healing.enabled` Cruise Control configuration to `true` and allow the users to set other appropriate `self-healing` related [configurations](https://github.com/linkedin/cruise-control/wiki/Configurations#selfhealingnotifier-configurations).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
When the feature gate is enabled, the operator will set the `self.healing.enabled` Cruise Control configuration to `true` and allow the users to set other appropriate `self-healing` related [configurations](https://github.com/linkedin/cruise-control/wiki/Configurations#selfhealingnotifier-configurations).
When the `UseSelfHealing` feature gate is enabled, users will be able to activate self-healing by configuring the [`self.healing` Cruise Control configurations](https://github.com/linkedin/cruise-control/wiki/Configurations#selfhealingnotifier-configurations) in the `spec.cruiseControl.config` of their `Kafka` resource.
When the `UseSelfHealing` feature gate is disabled, users will not be able to activate self healing and all `self.healing` Cruise Control configurations will be ignored.

self.healing.enabled: true
```

Where the user wants to enable self-healing for some specific anomalies, they can use configurations like `self.healing.broker.failure.enabled` for broker failure, `self.healing.goal.violation.enabled` for goal violation etc.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the user doesn't specify any specific anomalies(i.e you used just self.healing.enabled: true) then self healing would be detecting all the anomalies.

I wonder if it would be worth having all anomaly detection disabled by default and force users to explicitly enable which specific anomalies they want to detect and have fixed automatically. I imagine the alerting of some anomalies might be pretty noisy and the automatic fixes may be quite taxing on the cluster, having all anomalies enabled at once by default might surprise and overload users not familiar with the feature.


Cruise Control provides `AnomalyNotifier` interface which has multiple abstract methods on what to do if certain anomalies are detected.
`SelfHealingNotifier` is the base class that contains the logic of self-healing and override the methods in by implementing the `AnomalyNotifier` interface.
Some of those methods are:`onGoalViolation()`, `onBrokerFailure()`, `onDisk Failure`, `alert()` etc.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Some of those methods are:`onGoalViolation()`, `onBrokerFailure()`, `onDisk Failure`, `alert()` etc.
Some of those methods are:`onGoalViolation()`, `onBrokerFailure()`, `onDiskFailure()`, `alert()` etc.

`SelfHealingNotifier` is the base class that contains the logic of self-healing and override the methods in by implementing the `AnomalyNotifier` interface.
Some of those methods are:`onGoalViolation()`, `onBrokerFailure()`, `onDisk Failure`, `alert()` etc.
The `SelfHealingNotifier` class can then be further used to implement your own notifier
The other custom notifier offered by Cruise control are also based upon the `SelfHealingNotifier`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The other custom notifier offered by Cruise control are also based upon the `SelfHealingNotifier`.
The other custom notifiers offered by Cruise control are also based upon the `SelfHealingNotifier`.


### Strimzi Operations while self-healing is running/fixing an anomaly

Self-healing is a very robust and resilient feature due to its ability to tackle issues which can appear while the healing of cluster is happening.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Self-healing is a very robust and resilient feature due to its ability to tackle issues which can appear while the healing of cluster is happening.
Self-healing is a very robust and resilient feature due to its ability to tackle issues while healing a cluster.

Signed-off-by: ShubhamRwt <shubhamrwt02@gmail.com>

#### What happens when self-healing is running/fixing an anomaly and brokers are rolled

If a broker fails, self-healing will trigger and Cruise Control will try to move the partition replicas from the failed broker to other healthy brokers in the Kafka cluster.
If, during the process of moving partitions from the failed broker, another broker in the Kafka cluster gets deleted or rolled, then Cruise Control would finish the transfer from the failed broker but log that `self-healing finished successful with error`.
If some anomaly is getting healed , Cruise Control will then try to fix/mitigate the anomaly by moving the partition replicas across the brokers in the Kafka cluster.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When an anomaly is detected, Cruise Control may try to fix/mitigate it by moving the partition replicas across the brokers in the Kafka cluster.

The detected anomalies can be of various types:
* Goal Violation - This happens if certain optimization goals are violated (e.g. DiskUsageDistributionGoal etc.). These goals can be configured independently (through the `self.healing.goals` config) to those used for manual rebalancing..
* Goal Violation - This happens if certain optimization goals are violated (e.g. DiskUsageDistributionGoal etc.). These goals can be configured independently (through the `self.healing.goals` configuration under the `spec.cruiseControl.config` section) to those used for manual rebalancing(goals configured in the KafkaRebalance custom resource).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this part confuses me slightly, maybe can be reworded better.
Is the below what you mean?

  • Goal Violation - This happens if certain optimization goals such as DiskUsageDistributionGoal are violated. The optimization goals can be configured independently from KafkaRebalance custom resource by using spec.cruiseControl.config.self.healing.goals in Kafka custom resource.

When you say certain optimization goals, are you saying only some would trigger Goal Violation anomaly?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, AFAIK it would only be those goals specifically listed in the self healing goals configuration.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to make the sentence less confusing, I hope its better now

@@ -76,17 +79,18 @@ statusUpdateDate=2024-12-09T09:49:47Z,
failedDisksByTimeMs={2={/var/lib/kafka/data-1/kafka-log2=1733737786603}},
status=FIX_STARTED}]}
```
where the `anomalyID` represents the task id assigned to the anomaly, `failedDisksByTimeMs` represents the disk which failed, `detectionDate` and `statusUpdateDate` represet the date and time when the anomaly was detected and the status of the anomaly was updated.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see the mentioned anomalyType field snippet. Is this a full snippet of an anomaly?

I guess not that important, since this is just an example, but should we explain the value of failedDisksByTimeMs? {2={/var/lib/kafka/data-1/kafka-log2=1733737786603}} what these value mean, how it represents the failed disk? If you think, this is not important, please feel free ignore this part.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added some explanation around it

* COMPLETENESS_NOT_READY - Completeness not met for some goals. The completeness describes the confidence level of the data in the metric sample aggregator. If this is low then some Goal classes may refuse to produce proposals.

We can also poll the `state` endpoint, with `substate` parameter as `executor`, to get details on the tasks that are currently being performed by Cruise Control.
We can again poll the [`state` endpoint](https://github.com/linkedin/cruise-control/wiki/REST-APIs#query-the-state-of-cruise-control), with `substate` parameter as `executor`, to get details on the tasks that are currently being performed by Cruise Control.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will we provide, how users might poll this end point?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By this you mean mentioning the complete curl command?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added the curl command to poll the endpoint

@@ -177,13 +183,13 @@ The event published would look like this:
type: Normal
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#152 - would this proposal have an impact on the format of the Kubernetes event for consistency?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure, the notifier would reside as a separate repo under Strimzi, but I can adapt the events to look like this in the proposal. Wdyt @ppatierno?

Signed-off-by: ShubhamRwt <shubhamrwt02@gmail.com>
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants