Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

High Availability RabbitMQ Cluster #1

Open
steefaan opened this issue Oct 20, 2015 · 1 comment
Open

High Availability RabbitMQ Cluster #1

steefaan opened this issue Oct 20, 2015 · 1 comment

Comments

@steefaan
Copy link

Maybe some of you follow the discussion in the last days regarding the high availability of RabbitMQ. @lorenzo suggests some useful possibilities to setup a high avilability RabbitMQ cluster.

  1. Consul with RabbitMQ Cluster - It's a nice solution but Consul has a time frame between health checks. If a server goes down during this time frame, the server will request a non-reachable node.
  2. HAProxy with RabbitMQ Cluster - Also a nice solution and a very nice load balancer but you need to setup the load balancer on all your webserver nodes. You still have issues if the load balancer crashes or if you send a request to a RabbitMQ node during the health check period even if you set it to 1000ms.

After a long research and a lot of possible options I came to the conclusion that a simple pool of RabbitMQ nodes inside the plugin config would be the best solution for high availability. This way you have a pool of all your RabbitMQ nodes and you will loop through the nodes until you found an accessible one. This would need some modifications on the proccess-mq plugin. I would do this modifications with a PR but I need to know first if I have a chance to get this merged.

Advantages of the given approach:

  • You're independent of health check preiods from other applications (HAProxy/Consul/...)
  • You're independent of other applications in general, it's plain PHP/CakePHP... this will save configuration time and you have less dependencies
@jippi
Copy link
Member

jippi commented Oct 20, 2015

we use the consul approach at bownty, and its quite durable

what we do is to expose the a service (e.g. queue.service.bownty) that we connect to - consul will then route the request to any service being online (and passing a basic health check)

we run our health-checks every 5s, you could make it once a second just fine. When i plan a downtime, i simply make the consul agent leave, then dns / service is propagated right away, and then use a trigger to restart the relevant jobs in supervisor

in case something goes down, supervisor will simply keep restarting the workers until they resolve a working server (within the MAX 5s for the health check to re-run), not losing any jobs in the consuming end.

for the frontend, we currently just allow it to fail, as 2-3 sec service disruption once every 3-5 month is trivial to us vs the added complexity of doing something more complex - we could reduce the service time to 1s to more or less avoid any issues :)

we've never seen rabbitmq crash (thanks erlang), but the issues caused by rabbit is on queue throttle due to resource constraints (backoff for too busy queues) - which is super hard to handle reliably - and wont be resolved be either of your solutions either - as you can connect, but can't publish to the queue, no matter what box you connect to

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants