High Availability RabbitMQ Cluster #1

steefaan · 2015-10-20T11:39:22Z

Maybe some of you follow the discussion in the last days regarding the high availability of RabbitMQ. @lorenzo suggests some useful possibilities to setup a high avilability RabbitMQ cluster.

Consul with RabbitMQ Cluster - It's a nice solution but Consul has a time frame between health checks. If a server goes down during this time frame, the server will request a non-reachable node.
HAProxy with RabbitMQ Cluster - Also a nice solution and a very nice load balancer but you need to setup the load balancer on all your webserver nodes. You still have issues if the load balancer crashes or if you send a request to a RabbitMQ node during the health check period even if you set it to 1000ms.

After a long research and a lot of possible options I came to the conclusion that a simple pool of RabbitMQ nodes inside the plugin config would be the best solution for high availability. This way you have a pool of all your RabbitMQ nodes and you will loop through the nodes until you found an accessible one. This would need some modifications on the proccess-mq plugin. I would do this modifications with a PR but I need to know first if I have a chance to get this merged.

Advantages of the given approach:

You're independent of health check preiods from other applications (HAProxy/Consul/...)
You're independent of other applications in general, it's plain PHP/CakePHP... this will save configuration time and you have less dependencies

jippi · 2015-10-20T17:10:59Z

we use the consul approach at bownty, and its quite durable

what we do is to expose the a service (e.g. queue.service.bownty) that we connect to - consul will then route the request to any service being online (and passing a basic health check)

we run our health-checks every 5s, you could make it once a second just fine. When i plan a downtime, i simply make the consul agent leave, then dns / service is propagated right away, and then use a trigger to restart the relevant jobs in supervisor

in case something goes down, supervisor will simply keep restarting the workers until they resolve a working server (within the MAX 5s for the health check to re-run), not losing any jobs in the consuming end.

for the frontend, we currently just allow it to fail, as 2-3 sec service disruption once every 3-5 month is trivial to us vs the added complexity of doing something more complex - we could reduce the service time to 1s to more or less avoid any issues :)

we've never seen rabbitmq crash (thanks erlang), but the issues caused by rabbit is on queue throttle due to resource constraints (backoff for too busy queues) - which is super hard to handle reliably - and wont be resolved be either of your solutions either - as you can connect, but can't publish to the queue, no matter what box you connect to

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

High Availability RabbitMQ Cluster #1

High Availability RabbitMQ Cluster #1

steefaan commented Oct 20, 2015

jippi commented Oct 20, 2015

High Availability RabbitMQ Cluster #1

High Availability RabbitMQ Cluster #1

Comments

steefaan commented Oct 20, 2015

jippi commented Oct 20, 2015