Handle transient queue deletion in Khepri minority #11979

the-mikedavis · 2024-08-12T20:16:14Z

Transient queue deletion previously caused a crash if Khepri was enabled and a node with a transient queue went down while its cluster was in a minority. We need to handle the {error,timeout} return possible from rabbit_db_queue:delete_transient/1. In the rabbit_amqqueue:on_node_down/1 callback we log a warning when we see this return.

We then try this deletion again during that node's rabbit_khepri:init/0 which is called from a boot step after rabbit_khepri:setup/0. At that point we can return an error and halt the node's boot if the command times out. The cluster is very likely to be in a majority at that point since rabbit_khepri:setup/0 waits for a leader to be elected (requiring a majority).

This fixes a crash report found in the cluster_minority_SUITE's end_per_group.

deps/rabbit/src/rabbit_db.erl

deps/rabbit/src/rabbit_db_queue.erl

deps/rabbit/src/rabbit_db.erl

The prior code skirted transactions because the filter function might cause Khepri to call itself. We want to use the same idea as the old code - get all queues, filter them, then delete them - but we want to perform the deletion in a transaction and fail the transaction if any queues changed since we read them. This fixes a bug - that the call to `delete_in_khepri/2` could return an error tuple that would be improperly recognized as `Deletions` - but should also make deleting transient queues atomic and fast. Each call to `delete_in_khepri/2` needed to wait on Ra to replicate because the deletion is an individual command sent from one process. Performing all deletions at once means we only need to wait for one command to be replicated across the cluster. We also bubble up any errors to delete now rather than storing them as deletions. This fixes a crash that occurs on node down when Khepri is in a minority.

Transient queue deletion previously caused a crash if Khepri was enabled and a node with a transient queue went down while its cluster was in a minority. We need to handle the `{error,timeout}` return possible from `rabbit_db_queue:delete_transient/1`. In the `rabbit_amqqueue:on_node_down/1` callback we log a warning when we see this return. We then try this deletion again during that node's `rabbit_khepri:init/0` which is called from a boot step after `rabbit_khepri:setup/0`. At that point we can return an error and halt the node's boot if the command times out. The cluster is very likely to be in a majority at that point since `rabbit_khepri:setup/0` waits for a leader to be elected (requiring a majority). This fixes a crash report found in the `cluster_minority_SUITE`'s `end_per_group`.

Handle transient queue deletion in Khepri minority (backport #11979) (backport #11990)

the-mikedavis added 2 commits August 12, 2024 14:10

rabbit_db: Lower log level of Khepri members log line

053c871

Move Khepri DB init to rabbit_khepri:init/0

d0da0b5

the-mikedavis added backport-v3.13.x backport-v4.0.x labels Aug 12, 2024

the-mikedavis commented Aug 12, 2024

View reviewed changes

deps/rabbit/src/rabbit_db.erl Show resolved Hide resolved

the-mikedavis marked this pull request as ready for review August 13, 2024 04:26

dumbbell requested changes Aug 13, 2024

View reviewed changes

deps/rabbit/src/rabbit_db_queue.erl Outdated Show resolved Hide resolved

deps/rabbit/src/rabbit_db.erl Show resolved Hide resolved

the-mikedavis added 2 commits August 13, 2024 11:40

the-mikedavis force-pushed the md/khepri/transient-queue-deletion-minority branch from 78fa268 to 3f734ef Compare August 13, 2024 15:40

dumbbell approved these changes Aug 13, 2024

View reviewed changes

the-mikedavis merged commit 267d7b8 into main Aug 13, 2024
238 checks passed

the-mikedavis deleted the md/khepri/transient-queue-deletion-minority branch August 13, 2024 18:51

This was referenced Aug 13, 2024

Handle transient queue deletion in Khepri minority (backport #11979) #11990

Merged

Handle transient queue deletion in Khepri minority (backport #11979) (backport #11990) #11991

Merged

michaelklishin added a commit that referenced this pull request Aug 14, 2024

Merge pull request #11991 from rabbitmq/mergify/bp/v3.13.x/pr-11990

79f6507

Handle transient queue deletion in Khepri minority (backport #11979) (backport #11990)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle transient queue deletion in Khepri minority #11979

Handle transient queue deletion in Khepri minority #11979

the-mikedavis commented Aug 12, 2024

Handle transient queue deletion in Khepri minority #11979

Handle transient queue deletion in Khepri minority #11979

Conversation

the-mikedavis commented Aug 12, 2024