Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Fix restart logic for ScheduleOnly init policy #756

Merged
merged 1 commit into from
Apr 3, 2024

Conversation

spilchen
Copy link
Collaborator

@spilchen spilchen commented Apr 3, 2024

This change is aimed at resolving a specific issue encountered during the deployment of VerticaDB with the ScheduleOnly initialization policy. Previously, when a restart was triggered, the logic in the reconciler would prevent individual nodes from restarting if the cluster lacked quorum based on the pod state. The intention was to let the spread process bring down all remaining nodes before initiating a restart of the entire cluster. However, this approach didn't work with the ScheduleOnly policy because it didn't have a complete view of the cluster; it only had information about the pods running in k8s. There could be other nodes outside of VerticaDB contributing to the cluster's quorum. To address this, I'm refining the logic to perform the quorum check only when the initialization policy isn't ScheduleOnly.

This is a regression that was introduced in version 2.0.0 of the operator. There's an e2e test specifically designed to verify this behavior. However, it was mistakenly disabled. That's what ended up causing this regression in the first place. Another pull request (#751) will be submitted to re-enable this test.

This change is aimed at resolving a specific issue encountered during
the deployment of VerticaDB with the ScheduleOnly initialization policy.
Previously, when a restart was triggered, the logic in the reconciler
would prevent individual nodes from restarting if the cluster lacked
quorum based on the pod state. The intention was to let the spread
process bring down all remaining nodes before initiating a restart of
the entire cluster. However, this approach didn't work effectively with
the ScheduleOnly policy because it didn't have a complete view of the
cluster; it only had information about the pods running in k8s. There
could be other components outside of VerticaDB contributing to the
cluster's quorum. To address this, I'm refining the logic to perform the
quorum check only when the initialization policy isn't ScheduleOnly.

This is a regression that was introduced in version 2.0.0 of the
operator. There's an e2e test specifically designed to verify this
behavior. However, it was mistakenly disabled. That's what ended up
causing this regression in the first place. Another pull request (#751)
will be submitted to re-enable this test.
@spilchen spilchen self-assigned this Apr 3, 2024
@spilchen spilchen merged commit d0161c5 into main Apr 3, 2024
30 checks passed
@spilchen spilchen deleted the spilchen/fix-schedule-only branch April 3, 2024 17:17
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants