Fix restart logic for ScheduleOnly init policy #756
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This change is aimed at resolving a specific issue encountered during the deployment of VerticaDB with the ScheduleOnly initialization policy. Previously, when a restart was triggered, the logic in the reconciler would prevent individual nodes from restarting if the cluster lacked quorum based on the pod state. The intention was to let the spread process bring down all remaining nodes before initiating a restart of the entire cluster. However, this approach didn't work with the ScheduleOnly policy because it didn't have a complete view of the cluster; it only had information about the pods running in k8s. There could be other nodes outside of VerticaDB contributing to the cluster's quorum. To address this, I'm refining the logic to perform the quorum check only when the initialization policy isn't ScheduleOnly.
This is a regression that was introduced in version 2.0.0 of the operator. There's an e2e test specifically designed to verify this behavior. However, it was mistakenly disabled. That's what ended up causing this regression in the first place. Another pull request (#751) will be submitted to re-enable this test.