Handle transient queue deletion in Khepri minority (backport #11979) (backport #11990) #11991

mergify · 2024-08-13T19:58:33Z

Transient queue deletion previously caused a crash if Khepri was enabled and a node with a transient queue went down while its cluster was in a minority. We need to handle the {error,timeout} return possible from rabbit_db_queue:delete_transient/1. In the rabbit_amqqueue:on_node_down/1 callback we log a warning when we see this return.

We then try this deletion again during that node's rabbit_khepri:init/0 which is called from a boot step after rabbit_khepri:setup/0. At that point we can return an error and halt the node's boot if the command times out. The cluster is very likely to be in a majority at that point since rabbit_khepri:setup/0 waits for a leader to be elected (requiring a majority).

This fixes a crash report found in the cluster_minority_SUITE's end_per_group.

This is an automatic backport of pull request #11979 done by [Mergify](https://mergify.com).

This is an automatic backport of pull request #11990 done by [Mergify](https://mergify.com).

(cherry picked from commit 053c871) (cherry picked from commit 301e82e)

(cherry picked from commit d0da0b5) (cherry picked from commit 950f555)

The prior code skirted transactions because the filter function might cause Khepri to call itself. We want to use the same idea as the old code - get all queues, filter them, then delete them - but we want to perform the deletion in a transaction and fail the transaction if any queues changed since we read them. This fixes a bug - that the call to `delete_in_khepri/2` could return an error tuple that would be improperly recognized as `Deletions` - but should also make deleting transient queues atomic and fast. Each call to `delete_in_khepri/2` needed to wait on Ra to replicate because the deletion is an individual command sent from one process. Performing all deletions at once means we only need to wait for one command to be replicated across the cluster. We also bubble up any errors to delete now rather than storing them as deletions. This fixes a crash that occurs on node down when Khepri is in a minority. (cherry picked from commit 0dd26f0) (cherry picked from commit 0f90906) # Conflicts: # deps/rabbit_common/src/rabbit_misc.erl

Transient queue deletion previously caused a crash if Khepri was enabled and a node with a transient queue went down while its cluster was in a minority. We need to handle the `{error,timeout}` return possible from `rabbit_db_queue:delete_transient/1`. In the `rabbit_amqqueue:on_node_down/1` callback we log a warning when we see this return. We then try this deletion again during that node's `rabbit_khepri:init/0` which is called from a boot step after `rabbit_khepri:setup/0`. At that point we can return an error and halt the node's boot if the command times out. The cluster is very likely to be in a majority at that point since `rabbit_khepri:setup/0` waits for a leader to be elected (requiring a majority). This fixes a crash report found in the `cluster_minority_SUITE`'s `end_per_group`. (cherry picked from commit 3f734ef) (cherry picked from commit 006f517)

the-mikedavis added 2 commits August 13, 2024 19:58

rabbit_db: Lower log level of Khepri members log line

dc5fabc

(cherry picked from commit 053c871) (cherry picked from commit 301e82e)

Move Khepri DB init to rabbit_khepri:init/0

666d2bc

(cherry picked from commit d0da0b5) (cherry picked from commit 950f555)

mergify bot added the conflicts label Aug 13, 2024

This comment was marked as resolved.

# to view

the-mikedavis added 2 commits August 13, 2024 16:18

the-mikedavis force-pushed the mergify/bp/v3.13.x/pr-11990 branch from ee66e8d to 4632a5a Compare August 13, 2024 20:19

michaelklishin added this to the 3.13.7 milestone Aug 14, 2024

michaelklishin merged commit 79f6507 into v3.13.x Aug 14, 2024
188 checks passed

michaelklishin deleted the mergify/bp/v3.13.x/pr-11990 branch August 14, 2024 01:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle transient queue deletion in Khepri minority (backport #11979) (backport #11990) #11991

Handle transient queue deletion in Khepri minority (backport #11979) (backport #11990) #11991

mergify bot commented Aug 13, 2024

This comment was marked as resolved.

Handle transient queue deletion in Khepri minority (backport #11979) (backport #11990) #11991

Handle transient queue deletion in Khepri minority (backport #11979) (backport #11990) #11991

Conversation

mergify bot commented Aug 13, 2024

This comment was marked as resolved.