-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Proposal for adding Self healing feature in the operator #145
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: ShubhamRwt <shubhamrwt02@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the proposal.
I think it would be useful to use a regular human language here and not the Cruise Control language as that would bring better perspective on what exactly is being proposed here and what is not being proposed here (and what actually the self-healing does and does not do).
I do not understand the role of the StrimziNotifier
. While I doubt that anyone will find it appealing to have Kubernetes events used for the notification instead of Slack or Prometheus, I cannot of course rule out that this will find its users. But it seems to be only marginally related to Strimzi. I'm sure we can host this under the Strimzi GitHub organization if desired, but it seems more like an unrelated subproject that is separate from the self-healing feature it self.
The actual self-healing ... you focus a lot on the Disk and Broker failures. But it seems to me like the self-healing does not offer good solutions for this. The idea to solve this by rescheduling the partition replicas to other brokers seems to be well suited for Kafka on bare-metal with local storage where a dead server often means that you have to get a brand new server to replace it. It does not seem to be well suited to self-healing Kubernetes infrastructure and for cloud-based environments, where Pods can be just rescheduled and new disk can be provisioned in the matter of seconds.
So it seems to me like the self-healing itself as done by Cruise Control could be useful for dealing with inbalanced brokers. But for situations such as broker or disk failures or insufficient capacity, we should be more ambitious and aim to solve them properly in the Strimzi Cluster Operator. That is where an actual StrimziNotifier
might play a future role to not issue any Kubernetes Events to tell the user about broker disk, but provide a mechanism to tell Strimzi about it so that Strimzi can react to it - e.g. by scaling for capacity based recommendation from CC or as suggested delete broken PVC and get a new one provisioned to replace it. While I underttand that this might be out of scope for this proposal, I think it is important to leave space for it. For example by focusing here on the auto-rebalancing features and leaving the other detected issues unused for future development.
## Current situation | ||
|
||
During normal operations it's common for Kafka clusters to become unbalanced (through partition key skew, uneven partition placement etc.) or for problems to occur with disks or other underlying hardware which makes the cluster unhealthy. | ||
Currently, if we encounter any such scenario we need to fix these issues manually i.e. if there is some broker failure then we might move the partition replicas from that corrupted broker to some other healthy broker by using the `KafkaRebalance` custom resource in the [`remove-broker`](https://strimzi.io/docs/operators/latest/full/deploying.html#proc-generating-optimization-proposals-str) mode. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not some bare-metal cluster but Kubernetes. So I'm struggling to find the issues where you would need to do what you suggest. Instead, on Kubernetes, you would normally just fix the issue on the existing broker.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can see your point. At Kuberntes level, maybe a broker failure is more related to lack of resources which you should fix at cluster level and not removing the broker. A better example could be a disk failure on a broker or some goal violation (which is the usual reason why we use manual rebalancing).
|
||
During normal operations it's common for Kafka clusters to become unbalanced (through partition key skew, uneven partition placement etc.) or for problems to occur with disks or other underlying hardware which makes the cluster unhealthy. | ||
Currently, if we encounter any such scenario we need to fix these issues manually i.e. if there is some broker failure then we might move the partition replicas from that corrupted broker to some other healthy broker by using the `KafkaRebalance` custom resource in the [`remove-broker`](https://strimzi.io/docs/operators/latest/full/deploying.html#proc-generating-optimization-proposals-str) mode. | ||
With smaller clusters, it is feasible to fix things manually. However, for larger ones it can be very time-consuming, or just not feasible, to fix all the issue on your own. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you should clarify what issues
are you talking about here. From the past discussion, the terminology used here seems highly confusing ... and I suspect that the words such as issues
, anomalies
, and problems
are ambiguous at best`.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For me they are all synonyms more or less. Maybe just use one and stick with it?
|
||
## Motivation | ||
|
||
With the help of the self-healing feature we can resolve the issues like disk failure, broker failure and other issues automatically. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nothing in this proposal suggests that Cruise Control can solve disk or broker failures.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAICS, the proposal introduces the concept of broker/disk failure detectors and related fix class which highlight that CC can solve them. Or what are you looking for?
|
||
With the help of the self-healing feature we can resolve the issues like disk failure, broker failure and other issues automatically. | ||
Cruise Control treats these issues as "anomalies", the detection of which is the responsibility of the anomaly detector mechanism. | ||
Currently, users are [able to set]((https://strimzi.io/docs/operators/latest/full/deploying.html#setting_up_alerts_for_anomaly_detection)) the notifier to one of those included with Cruise Control (<list of included notifiers>). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The link seems broken? (Double (
and )
?)
The anomaly detector manager helps in detecting the anomalies and also handling the detected anomalies. | ||
It acts as a coordinator between the detector classes as well as the classes which will be handling and resolving the anomalies. | ||
Various detector classes like `GoalViolationDetector`, `DiskFailureDetector`, `KafkaBrokerFailureDetector` etc. are used for the anomaly detection, which runs periodically to check if the cluster have their corresponding anomalies or not (This periodic check can be easily configured through the `anomaly.detection.interval.ms` configuration). | ||
Every detector class works in a different way to detect their corresponding anomalies. A `KafkaBrokerFailureDetector` will require a deep usage of requesting metadata to the Kafka cluster (i.e. broker failure) while if we talk about `DiskFailureDetector` it will be more inclined towards using the admin client API. In the same way the `MetricAnomalyDetector` will be making use of metrics and other detector like `GoalViolationDetector` and `TopicAnomalyDetector` will have their own way to detect anomalies. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what exactly it is trying to say ... but it does not seem to make sense to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if we need this level of details. Maybe just link to the doc of these detectors? If not, maybe reword the sentence to something like this:
Every detector class works in a different way to detect their corresponding anomalies. A `KafkaBrokerFailureDetector` will require a deep usage of requesting metadata to the Kafka cluster (i.e. broker failure) while if we talk about `DiskFailureDetector` it will be more inclined towards using the admin client API. In the same way the `MetricAnomalyDetector` will be making use of metrics and other detector like `GoalViolationDetector` and `TopicAnomalyDetector` will have their own way to detect anomalies. | |
Detector classes have different mechanisms to detect their corresponding anomalies. For example, `KafkaBrokerFailureDetector` utilises Kafka Metadata API whereas `DiskFailureDetector` utilises Kafka Admin API. Furthermore, `MetricAnomalyDetector` uses metrics while other detector such as `GoalViolationDetector` and `TopicAnomalyDetector` use a different mechanism to detect anomalies. |
Maybe we should mention briefly how GoalViolationDetector
and TopicAnomalyDetector
detect anomalies like these others too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what exactly it is trying to say ... but it does not seem to make sense to me.
What the text is trying to say is that each detector has its own way to detect the anomaly (i.e. using admin client API, or deep metadata API request, or reading metrics and so on). Maybe the changes proposed by @tinaselenge makes it clearer?
``` | ||
|
||
The `StrimziNotifier` class is going to reside in a separate module in the operator repository called the `strimzi-cruise-control-notifier` so that a seperate jar for the notifier can be built for inclusion in the Cruise Control container image. | ||
We will also need to include the fabric8 client dependencies in the notifier JAR that will be used for generating the Kubernetes events as well as for reading the annotation on the Kafka resource from the notifier. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reading what annotation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think you are anticipating what's going to be in the next paragraph but this sentence doesn't make any sense here.
The user should also have the option to pause the self-healing feature, in case they know or plan to do some rolling restart of the cluster or some rebalance they are going to perform which can cause issue with the self-healing itself. | ||
For this, we will be having an annotation called `strimzi.io/self-healing-paused`. | ||
We can set this annotation on the Kafka custom resource to pause the self-healing feature. | ||
self-healing is not paused by default so missing annotation would mean that anomalies detected will keep getting healed automatically. | ||
When self-healing is paused, it would mean that all the anomalies that are detected after the pause annotation is applied would be ignored and no fix would be done for them. | ||
The anomalies that were already getting fixed would continue with the fix. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is pausing the self-healing useful for the users? Assuming it is useful, why not just disable the self-healing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Disabling self-healing would mean restarting CC. Pausing mean that the notifier just returns "ignore" to the anomaly detector (but you could still see anomalies logged in the CC pod for example).
|
||
#### What happens when self-healing is running/fixing an anomaly and brokers are rolled | ||
|
||
If a broker fails, self-healing will trigger and Cruise Control will try to move the partition replicas from the failed broker to other healthy brokers in the Kafka cluster. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What exactly does it mean when the broker fails
? How does CC recognize it? How does it distinguish failure for example from some temporary Kubernetes scheduling issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The broker anomaly detector uses deep metadata request to the Kafka cluster to know the brokers status. The broker failure in particular is not fixed straight away but it's first delayed in order to double check "later" if the broker is still offline or not. All the timing between double checks are configurable. So any temporary issue will not be reported as anomaly and tried to be fixed.
#### What happens when self-healing is running/fixing an anomaly and brokers are rolled | ||
|
||
If a broker fails, self-healing will trigger and Cruise Control will try to move the partition replicas from the failed broker to other healthy brokers in the Kafka cluster. | ||
If, during the process of moving partitions from the failed broker, another broker in the Kafka cluster gets deleted or rolled, then Cruise Control would finish the transfer from the failed broker but log that `self-healing finished successful with error`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How would it finish the transfer from the failed broker
when the target broker is stopped? You mean that it will continue with it after the target broker is rolled?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, If the target broker is rolling at that moment, then self healing would end saying self-healing finished successful with error
. Then periodic anomaly check will again detect the failed broker
and try to fix it. If the target
broker is rolled by that time then self healing will start again else if the rolled broker is not up then we will will get two anomalies to fix the failed broker
and the target broker
|
||
#### What happens when self-healing is running/fixing an anomaly and Kafka Rebalance is applied | ||
|
||
If self-healing is ongoing and a `KafkaRebalance` resource is posted in the middle of it then the `KafkaRebalance` resource will move to `NotReady` state, stating that there is already a rebalance ongoing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this need to differenciate between asking for proposal and approving a proposal? For example when asking for proposal, I would expect the operator to wait and give me a proposal after the self-healing ends.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the following snippet I can see the dryrun=false
so this seems a condition happening when you approve a proposal. @ShubhamRwt does it mean that during self-healing, CC was able to return to you a proposal and you were able to approve it and then getting the error?
|
||
#### Anomaly Detector Manager | ||
|
||
The anomaly detector manager helps in detecting the anomalies and also handling the detected anomalies. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The anomaly detector manager helps in detecting the anomalies and also handling the detected anomalies. | |
The anomaly detector manager helps in detecting the anomalies as well as handling them. |
Various detector classes like `GoalViolationDetector`, `DiskFailureDetector`, `KafkaBrokerFailureDetector` etc. are used for the anomaly detection, which runs periodically to check if the cluster have their corresponding anomalies or not (This periodic check can be easily configured through the `anomaly.detection.interval.ms` configuration). | ||
Every detector class works in a different way to detect their corresponding anomalies. A `KafkaBrokerFailureDetector` will require a deep usage of requesting metadata to the Kafka cluster (i.e. broker failure) while if we talk about `DiskFailureDetector` it will be more inclined towards using the admin client API. In the same way the `MetricAnomalyDetector` will be making use of metrics and other detector like `GoalViolationDetector` and `TopicAnomalyDetector` will have their own way to detect anomalies. | ||
The detected anomalies can be of various types: | ||
* Goal Violation - This happens if certain optimization goals are violated (e.g. DiskUsageDistributionGoal etc.). These goals can be configured independently (through the `self.healing.goals` config) to those used for manual rebalancing.. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So you mean the goals in KafkaRebalance and goals self.healing.goals
are independent of each other?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you have specified hard goals in the KR, then the self.healing.goals
should be a superset of the hard goals
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you have specified hard goals in the KR, then the self.healing.goals should be a superset of the hard goals.
If by KR you mean the KafkaRebalance
custom resource, I don't think what you are saying is true. There is no relationship between a specific KafkaRebalance
and self-healing configuration.
Maybe are you referring to the general CC configuration? So the default hard goals in CC has to be a superset of self.healing.goals
(and not the other way around)?
The anomaly detector manager calls the notifier to get an action regarding, whether the anomaly should be fixed, delayed or ignored. | ||
If the action is fix, then the anomaly detector manager calls the classes that are required to resolve the anomaly. | ||
|
||
Anomaly detection also has several other [configurations](https://github.com/linkedin/cruise-control/wiki/Configurations#anomalydetector-configurations), such as the detection interval and the anomaly notifier class, which can effect the performance of the Cruise Control server and the latency of the anomaly detection. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Anomaly detection also has several other [configurations](https://github.com/linkedin/cruise-control/wiki/Configurations#anomalydetector-configurations), such as the detection interval and the anomaly notifier class, which can effect the performance of the Cruise Control server and the latency of the anomaly detection. | |
Anomaly detection also has various [configurations](https://github.com/linkedin/cruise-control/wiki/Configurations#anomalydetector-configurations), such as the detection interval and the anomaly notifier class, which can affect the performance of the Cruise Control server and the latency of the anomaly detection. |
Cruise Control also provides [custom notifiers](https://github.com/linkedin/cruise-control/wiki/Configure-notifications) like Slack Notifier, Alerta Notifier etc. for notifying users regarding the anomalies. There are multiple other [self-healing notifier](https://github.com/linkedin/cruise-control/wiki/Configurations#selfhealingnotifier-configurations) related configurations you can use to make notifiers more efficient as per the use case. | ||
|
||
#### Self Healing | ||
If self-healing is enabled in Cruise Control and anomalies are detected, then a fix for these will be attempted automatically. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A fix will be attempted automatically only if anomaly.notifier.class
is configured to non default notifier right? So enabling self-healing alone is not enough, as with the default notifier (NoopNotifer), anomalies are detected but ignored?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you will need to override the notifier
too
The anomaly status would change based on how the healing progresses. The anomaly can transition to these possible status: | ||
* DETECTED - The anomaly is just detected and no action is yet taken on it | ||
* IGNORED - Self healing is either disabled or the anomaly is unfixable | ||
* FIX_STARTED - The anomaly has started getting fixed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* FIX_STARTED - The anomaly has started getting fixed | |
* FIX_STARTED - The anomaly fix has started |
* FIX_STARTED - The anomaly has started getting fixed | ||
* FIX_FAILED_TO_START - The proposal generation for the anomaly fix failed | ||
* CHECK_WITH_DELAY - The anomaly fix is delayed | ||
* LOAD_MONITOR_NOT_READY - Monitor which is monitoring the workload of a Kafka cluster is in loading or bootstrap state |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* LOAD_MONITOR_NOT_READY - Monitor which is monitoring the workload of a Kafka cluster is in loading or bootstrap state | |
* LOAD_MONITOR_NOT_READY - Monitor for workload of a Kafka cluster is in loading or bootstrap state |
self.healing.enabled: true | ||
``` | ||
|
||
Where the user wants to enable self-healing for some specific anomalies, they can use configurations like `self.healing.broker.failure.enabled` for broker failure, `self.healing.goal.violation.enabled` for goal violation etc. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any anomalies enabled by default? If user does not specify anomalies to detect, self-healing will do nothing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the user doesn't specify any specific anomalies(i.e you used just self.healing.enabled: true
) then self healing would be detecting all the anomalies. In case the user knows that somehow any anomaly will keep appearing since they have some issues in their cluster, then in that they can use skip the check for that particular anomaly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the user doesn't specify any specific anomalies(i.e you used just self.healing.enabled: true) then self healing would be detecting all the anomalies.
I wonder if it would be worth having all anomaly detection disabled by default and force users to explicitly enable which specific anomalies they want to detect and have fixed automatically. I imagine the alerting of some anomalies might be pretty noisy and the automatic fixes may be quite taxing on the cluster, having all anomalies enabled at once by default might surprise and overload users not familiar with the feature.
The notifier class returns the `action` that is going to be taken on the notified anomaly. | ||
These actions can be classified into three types: | ||
* FIX - Start the anomaly fix | ||
* CHECK - Delay the anomaly fix |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does the delay action, apply a configured delay before taking the action provided by notifier?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left some comments. Can you go through the proposal and split paragraphs with one sentence per line? Thanks!
## Current situation | ||
|
||
During normal operations it's common for Kafka clusters to become unbalanced (through partition key skew, uneven partition placement etc.) or for problems to occur with disks or other underlying hardware which makes the cluster unhealthy. | ||
Currently, if we encounter any such scenario we need to fix these issues manually i.e. if there is some broker failure then we might move the partition replicas from that corrupted broker to some other healthy broker by using the `KafkaRebalance` custom resource in the [`remove-broker`](https://strimzi.io/docs/operators/latest/full/deploying.html#proc-generating-optimization-proposals-str) mode. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can see your point. At Kuberntes level, maybe a broker failure is more related to lack of resources which you should fix at cluster level and not removing the broker. A better example could be a disk failure on a broker or some goal violation (which is the usual reason why we use manual rebalancing).
|
||
During normal operations it's common for Kafka clusters to become unbalanced (through partition key skew, uneven partition placement etc.) or for problems to occur with disks or other underlying hardware which makes the cluster unhealthy. | ||
Currently, if we encounter any such scenario we need to fix these issues manually i.e. if there is some broker failure then we might move the partition replicas from that corrupted broker to some other healthy broker by using the `KafkaRebalance` custom resource in the [`remove-broker`](https://strimzi.io/docs/operators/latest/full/deploying.html#proc-generating-optimization-proposals-str) mode. | ||
With smaller clusters, it is feasible to fix things manually. However, for larger ones it can be very time-consuming, or just not feasible, to fix all the issue on your own. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For me they are all synonyms more or less. Maybe just use one and stick with it?
|
||
## Motivation | ||
|
||
With the help of the self-healing feature we can resolve the issues like disk failure, broker failure and other issues automatically. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAICS, the proposal introduces the concept of broker/disk failure detectors and related fix class which highlight that CC can solve them. Or what are you looking for?
|
||
With the help of the self-healing feature we can resolve the issues like disk failure, broker failure and other issues automatically. | ||
Cruise Control treats these issues as "anomalies", the detection of which is the responsibility of the anomaly detector mechanism. | ||
Currently, users are [able to set]((https://strimzi.io/docs/operators/latest/full/deploying.html#setting_up_alerts_for_anomaly_detection)) the notifier to one of those included with Cruise Control (<list of included notifiers>). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the first time you mention the term "notifier" but it shows up out of the context. You should first explain what it is and why it's useful in general, not related to the self-healing but more to the alerting.
The anomaly detector manager helps in detecting the anomalies and also handling the detected anomalies. | ||
It acts as a coordinator between the detector classes as well as the classes which will be handling and resolving the anomalies. | ||
Various detector classes like `GoalViolationDetector`, `DiskFailureDetector`, `KafkaBrokerFailureDetector` etc. are used for the anomaly detection, which runs periodically to check if the cluster have their corresponding anomalies or not (This periodic check can be easily configured through the `anomaly.detection.interval.ms` configuration). | ||
Every detector class works in a different way to detect their corresponding anomalies. A `KafkaBrokerFailureDetector` will require a deep usage of requesting metadata to the Kafka cluster (i.e. broker failure) while if we talk about `DiskFailureDetector` it will be more inclined towards using the admin client API. In the same way the `MetricAnomalyDetector` will be making use of metrics and other detector like `GoalViolationDetector` and `TopicAnomalyDetector` will have their own way to detect anomalies. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what exactly it is trying to say ... but it does not seem to make sense to me.
What the text is trying to say is that each detector has its own way to detect the anomaly (i.e. using admin client API, or deep metadata API request, or reading metrics and so on). Maybe the changes proposed by @tinaselenge makes it clearer?
### Pausing the self-healing feature | ||
|
||
The user should also have the option to pause the self-healing feature, in case they know or plan to do some rolling restart of the cluster or some rebalance they are going to perform which can cause issue with the self-healing itself. | ||
For this, we will be having an annotation called `strimzi.io/self-healing-paused`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this, we will be having an annotation called `strimzi.io/self-healing-paused`. | |
For this, we will be having an annotation called `strimzi.io/self-healing-paused` on the `Kafka` custom resource. |
The user should also have the option to pause the self-healing feature, in case they know or plan to do some rolling restart of the cluster or some rebalance they are going to perform which can cause issue with the self-healing itself. | ||
For this, we will be having an annotation called `strimzi.io/self-healing-paused`. | ||
We can set this annotation on the Kafka custom resource to pause the self-healing feature. | ||
self-healing is not paused by default so missing annotation would mean that anomalies detected will keep getting healed automatically. | ||
When self-healing is paused, it would mean that all the anomalies that are detected after the pause annotation is applied would be ignored and no fix would be done for them. | ||
The anomalies that were already getting fixed would continue with the fix. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Disabling self-healing would mean restarting CC. Pausing mean that the notifier just returns "ignore" to the anomaly detector (but you could still see anomalies logged in the CC pod for example).
uid: 71928731-df71-4b9f-ac77-68961ac25a2f | ||
``` | ||
|
||
We will have checks on methods like `onGoalViolation`, `onBrokerFailure`, `onTopicAnomaly` etc. in the `StrimziNotifier` to check if the `pause` annotation is applied on the Kafka custom resource or not. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is no pause
annotation. Use the exact name.
|
||
#### What happens when self-healing is running/fixing an anomaly and brokers are rolled | ||
|
||
If a broker fails, self-healing will trigger and Cruise Control will try to move the partition replicas from the failed broker to other healthy brokers in the Kafka cluster. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The broker anomaly detector uses deep metadata request to the Kafka cluster to know the brokers status. The broker failure in particular is not fixed straight away but it's first delayed in order to double check "later" if the broker is still offline or not. All the timing between double checks are configurable. So any temporary issue will not be reported as anomaly and tried to be fixed.
|
||
#### What happens when self-healing is running/fixing an anomaly and Kafka Rebalance is applied | ||
|
||
If self-healing is ongoing and a `KafkaRebalance` resource is posted in the middle of it then the `KafkaRebalance` resource will move to `NotReady` state, stating that there is already a rebalance ongoing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the following snippet I can see the dryrun=false
so this seems a condition happening when you approve a proposal. @ShubhamRwt does it mean that during self-healing, CC was able to return to you a proposal and you were able to approve it and then getting the error?
|
||
The anomaly detector manager helps in detecting the anomalies and also handling the detected anomalies. | ||
It acts as a coordinator between the detector classes as well as the classes which will be handling and resolving the anomalies. | ||
Various detector classes like `GoalViolationDetector`, `DiskFailureDetector`, `KafkaBrokerFailureDetector` etc. are used for the anomaly detection, which runs periodically to check if the cluster have their corresponding anomalies or not (This periodic check can be easily configured through the `anomaly.detection.interval.ms` configuration). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Various detector classes like `GoalViolationDetector`, `DiskFailureDetector`, `KafkaBrokerFailureDetector` etc. are used for the anomaly detection, which runs periodically to check if the cluster have their corresponding anomalies or not (This periodic check can be easily configured through the `anomaly.detection.interval.ms` configuration). | |
Various detector classes like `GoalViolationDetector`, `DiskFailureDetector`, `KafkaBrokerFailureDetector` etc. are used for the anomaly detection, which runs periodically to check if the cluster have their corresponding anomalies or not (The frequency of this check can be easily configured through the `anomaly.detection.interval.ms` configuration). |
Various detector classes like `GoalViolationDetector`, `DiskFailureDetector`, `KafkaBrokerFailureDetector` etc. are used for the anomaly detection, which runs periodically to check if the cluster have their corresponding anomalies or not (This periodic check can be easily configured through the `anomaly.detection.interval.ms` configuration). | ||
Every detector class works in a different way to detect their corresponding anomalies. A `KafkaBrokerFailureDetector` will require a deep usage of requesting metadata to the Kafka cluster (i.e. broker failure) while if we talk about `DiskFailureDetector` it will be more inclined towards using the admin client API. In the same way the `MetricAnomalyDetector` will be making use of metrics and other detector like `GoalViolationDetector` and `TopicAnomalyDetector` will have their own way to detect anomalies. | ||
The detected anomalies can be of various types: | ||
* Goal Violation - This happens if certain optimization goals are violated (e.g. DiskUsageDistributionGoal etc.). These goals can be configured independently (through the `self.healing.goals` config) to those used for manual rebalancing.. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* Goal Violation - This happens if certain optimization goals are violated (e.g. DiskUsageDistributionGoal etc.). These goals can be configured independently (through the `self.healing.goals` config) to those used for manual rebalancing.. | |
* Goal Violation - This happens if certain optimization goals are violated (e.g. DiskUsageDistributionGoal etc.). These goals can be configured independently (through the `self.healing.goals` config) from those used for manual rebalancing. |
The nuances between the goal configurations of Cruise Control can be quite confusing, to help keep them straight in my head I categorize them in the following way:
- The server-side Cruise Control goal configs in
spec.cruiseControl.config
of theKafka
resource - The client-side Cruise Control goal config in
spec
of theKafkaRebalance
resource
To avoid confusion, it may be worth adding some text that explains where and how the self.healing.goals
compares/relates to the other server-side Cruise Control goal configs - goals
, default.goals
, hard.goals
, anomaly.detection.goals
. It wouldn't have to be right here, but somewhere in the proposal where the self.healing
configuration is raised. When we add the self-healing feature we will need to add some documentation of how self.healing.goals
related to the other goal configurations anyways. We could use the same language used to describe the different goal configs in the Strimzi documentation and provide a link to those docs as well.
I think the language you and Paolo use in this thread referring to other goal configs like "superset" or subset is helpful. Since the goal configs are so complicated, I don't think it is a problem if we reuse text from the Strimzi Cruise Control goal documentation to emphasize the distinction of the self.healing.goals
config.
* Metric anomaly - This failure happens if metrics collected by Cruise Control have some anomaly in their value (e.g. a sudden rise in the log flush time metrics). | ||
|
||
The detected anomalies are inserted into a priority queue and the anomaly with the highest property is picked first to resolve. | ||
The anomaly detector manager calls the notifier to get an action regarding, whether the anomaly should be fixed, delayed or ignored. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The anomaly detector manager calls the notifier to get an action regarding, whether the anomaly should be fixed, delayed or ignored. | |
The anomaly detector manager calls the notifier to get an action regarding whether the anomaly should be fixed, delayed, or ignored. |
|
||
The detected anomalies are inserted into a priority queue and the anomaly with the highest property is picked first to resolve. | ||
The anomaly detector manager calls the notifier to get an action regarding, whether the anomaly should be fixed, delayed or ignored. | ||
If the action is fix, then the anomaly detector manager calls the classes that are required to resolve the anomaly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the action is fix, then the anomaly detector manager calls the classes that are required to resolve the anomaly. | |
If the action is fixed, then the anomaly detector manager calls the classes that are required to resolve the anomaly. |
|
||
The detected anomalies are inserted into a priority queue and the anomaly with the highest property is picked first to resolve. | ||
The anomaly detector manager calls the notifier to get an action regarding, whether the anomaly should be fixed, delayed or ignored. | ||
If the action is fix, then the anomaly detector manager calls the classes that are required to resolve the anomaly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the action is fix, then the anomaly detector manager calls the classes that are required to resolve the anomaly. | |
If the action is fixed, then the anomaly detector manager calls the classes that are required to resolve the anomaly. |
|
||
### Enabling self-healing in the Kafka custom resource | ||
|
||
When the feature gate is enabled, the operator will set the `self.healing.enabled` Cruise Control configuration to `true` and allow the users to set other appropriate `self-healing` related [configurations](https://github.com/linkedin/cruise-control/wiki/Configurations#selfhealingnotifier-configurations). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When the feature gate is enabled, the operator will set the `self.healing.enabled` Cruise Control configuration to `true` and allow the users to set other appropriate `self-healing` related [configurations](https://github.com/linkedin/cruise-control/wiki/Configurations#selfhealingnotifier-configurations). | |
When the `UseSelfHealing` feature gate is enabled, users will be able to activate self-healing by configuring the [`self.healing` Cruise Control configurations](https://github.com/linkedin/cruise-control/wiki/Configurations#selfhealingnotifier-configurations) in the `spec.cruiseControl.config` of their `Kafka` resource. | |
When the `UseSelfHealing` feature gate is disabled, users will not be able to activate self healing and all `self.healing` Cruise Control configurations will be ignored. |
self.healing.enabled: true | ||
``` | ||
|
||
Where the user wants to enable self-healing for some specific anomalies, they can use configurations like `self.healing.broker.failure.enabled` for broker failure, `self.healing.goal.violation.enabled` for goal violation etc. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the user doesn't specify any specific anomalies(i.e you used just self.healing.enabled: true) then self healing would be detecting all the anomalies.
I wonder if it would be worth having all anomaly detection disabled by default and force users to explicitly enable which specific anomalies they want to detect and have fixed automatically. I imagine the alerting of some anomalies might be pretty noisy and the automatic fixes may be quite taxing on the cluster, having all anomalies enabled at once by default might surprise and overload users not familiar with the feature.
|
||
Cruise Control provides `AnomalyNotifier` interface which has multiple abstract methods on what to do if certain anomalies are detected. | ||
`SelfHealingNotifier` is the base class that contains the logic of self-healing and override the methods in by implementing the `AnomalyNotifier` interface. | ||
Some of those methods are:`onGoalViolation()`, `onBrokerFailure()`, `onDisk Failure`, `alert()` etc. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some of those methods are:`onGoalViolation()`, `onBrokerFailure()`, `onDisk Failure`, `alert()` etc. | |
Some of those methods are:`onGoalViolation()`, `onBrokerFailure()`, `onDiskFailure()`, `alert()` etc. |
`SelfHealingNotifier` is the base class that contains the logic of self-healing and override the methods in by implementing the `AnomalyNotifier` interface. | ||
Some of those methods are:`onGoalViolation()`, `onBrokerFailure()`, `onDisk Failure`, `alert()` etc. | ||
The `SelfHealingNotifier` class can then be further used to implement your own notifier | ||
The other custom notifier offered by Cruise control are also based upon the `SelfHealingNotifier`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The other custom notifier offered by Cruise control are also based upon the `SelfHealingNotifier`. | |
The other custom notifiers offered by Cruise control are also based upon the `SelfHealingNotifier`. |
|
||
### Strimzi Operations while self-healing is running/fixing an anomaly | ||
|
||
Self-healing is a very robust and resilient feature due to its ability to tackle issues which can appear while the healing of cluster is happening. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Self-healing is a very robust and resilient feature due to its ability to tackle issues which can appear while the healing of cluster is happening. | |
Self-healing is a very robust and resilient feature due to its ability to tackle issues while healing a cluster. |
Signed-off-by: ShubhamRwt <shubhamrwt02@gmail.com>
|
||
#### What happens when self-healing is running/fixing an anomaly and brokers are rolled | ||
|
||
If a broker fails, self-healing will trigger and Cruise Control will try to move the partition replicas from the failed broker to other healthy brokers in the Kafka cluster. | ||
If, during the process of moving partitions from the failed broker, another broker in the Kafka cluster gets deleted or rolled, then Cruise Control would finish the transfer from the failed broker but log that `self-healing finished successful with error`. | ||
If some anomaly is getting healed , Cruise Control will then try to fix/mitigate the anomaly by moving the partition replicas across the brokers in the Kafka cluster. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When an anomaly is detected, Cruise Control may try to fix/mitigate it by moving the partition replicas across the brokers in the Kafka cluster.
The detected anomalies can be of various types: | ||
* Goal Violation - This happens if certain optimization goals are violated (e.g. DiskUsageDistributionGoal etc.). These goals can be configured independently (through the `self.healing.goals` config) to those used for manual rebalancing.. | ||
* Goal Violation - This happens if certain optimization goals are violated (e.g. DiskUsageDistributionGoal etc.). These goals can be configured independently (through the `self.healing.goals` configuration under the `spec.cruiseControl.config` section) to those used for manual rebalancing(goals configured in the KafkaRebalance custom resource). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this part confuses me slightly, maybe can be reworded better.
Is the below what you mean?
- Goal Violation - This happens if certain optimization goals such as
DiskUsageDistributionGoal
are violated. The optimization goals can be configured independently fromKafkaRebalance
custom resource by usingspec.cruiseControl.config.self.healing.goals
in Kafka custom resource.
When you say certain optimization goals, are you saying only some would trigger Goal Violation anomaly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, AFAIK it would only be those goals specifically listed in the self healing goals configuration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to make the sentence less confusing, I hope its better now
@@ -76,17 +79,18 @@ statusUpdateDate=2024-12-09T09:49:47Z, | |||
failedDisksByTimeMs={2={/var/lib/kafka/data-1/kafka-log2=1733737786603}}, | |||
status=FIX_STARTED}]} | |||
``` | |||
where the `anomalyID` represents the task id assigned to the anomaly, `failedDisksByTimeMs` represents the disk which failed, `detectionDate` and `statusUpdateDate` represet the date and time when the anomaly was detected and the status of the anomaly was updated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see the mentioned anomalyType
field snippet. Is this a full snippet of an anomaly?
I guess not that important, since this is just an example, but should we explain the value of failedDisksByTimeMs
? {2={/var/lib/kafka/data-1/kafka-log2=1733737786603}}
what these value mean, how it represents the failed disk? If you think, this is not important, please feel free ignore this part.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added some explanation around it
* COMPLETENESS_NOT_READY - Completeness not met for some goals. The completeness describes the confidence level of the data in the metric sample aggregator. If this is low then some Goal classes may refuse to produce proposals. | ||
|
||
We can also poll the `state` endpoint, with `substate` parameter as `executor`, to get details on the tasks that are currently being performed by Cruise Control. | ||
We can again poll the [`state` endpoint](https://github.com/linkedin/cruise-control/wiki/REST-APIs#query-the-state-of-cruise-control), with `substate` parameter as `executor`, to get details on the tasks that are currently being performed by Cruise Control. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will we provide, how users might poll this end point?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By this you mean mentioning the complete curl command?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added the curl command to poll the endpoint
@@ -177,13 +183,13 @@ The event published would look like this: | |||
type: Normal | |||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#152 - would this proposal have an impact on the format of the Kubernetes event for consistency?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure, the notifier would reside as a separate repo under Strimzi, but I can adapt the events to look like this in the proposal. Wdyt @ppatierno?
Signed-off-by: ShubhamRwt <shubhamrwt02@gmail.com>
This proposal shows how we plan to implement the self healing feature in the operator