Fix restart logic for ScheduleOnly init policy #756

spilchen · 2024-04-03T12:18:55Z

This change is aimed at resolving a specific issue encountered during the deployment of VerticaDB with the ScheduleOnly initialization policy. Previously, when a restart was triggered, the logic in the reconciler would prevent individual nodes from restarting if the cluster lacked quorum based on the pod state. The intention was to let the spread process bring down all remaining nodes before initiating a restart of the entire cluster. However, this approach didn't work with the ScheduleOnly policy because it didn't have a complete view of the cluster; it only had information about the pods running in k8s. There could be other nodes outside of VerticaDB contributing to the cluster's quorum. To address this, I'm refining the logic to perform the quorum check only when the initialization policy isn't ScheduleOnly.

This is a regression that was introduced in version 2.0.0 of the operator. There's an e2e test specifically designed to verify this behavior. However, it was mistakenly disabled. That's what ended up causing this regression in the first place. Another pull request (#751) will be submitted to re-enable this test.

This change is aimed at resolving a specific issue encountered during the deployment of VerticaDB with the ScheduleOnly initialization policy. Previously, when a restart was triggered, the logic in the reconciler would prevent individual nodes from restarting if the cluster lacked quorum based on the pod state. The intention was to let the spread process bring down all remaining nodes before initiating a restart of the entire cluster. However, this approach didn't work effectively with the ScheduleOnly policy because it didn't have a complete view of the cluster; it only had information about the pods running in k8s. There could be other components outside of VerticaDB contributing to the cluster's quorum. To address this, I'm refining the logic to perform the quorum check only when the initialization policy isn't ScheduleOnly. This is a regression that was introduced in version 2.0.0 of the operator. There's an e2e test specifically designed to verify this behavior. However, it was mistakenly disabled. That's what ended up causing this regression in the first place. Another pull request (#751) will be submitted to re-enable this test.

spilchen requested review from jizhuoyu and roypaulin April 3, 2024 12:18

spilchen self-assigned this Apr 3, 2024

spilchen mentioned this pull request Apr 3, 2024

Refactor e2e legs GitHub workflow #751

Merged

roypaulin approved these changes Apr 3, 2024

View reviewed changes

spilchen merged commit d0161c5 into main Apr 3, 2024
30 checks passed

spilchen deleted the spilchen/fix-schedule-only branch April 3, 2024 17:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix restart logic for ScheduleOnly init policy #756

Fix restart logic for ScheduleOnly init policy #756

spilchen commented Apr 3, 2024

Fix restart logic for ScheduleOnly init policy #756

Fix restart logic for ScheduleOnly init policy #756

Conversation

spilchen commented Apr 3, 2024