| HN Mirror

RabbitMQ doesn't have a good clustering story. The clustering was added after the fact, and it shows. I've written about it on HN several times before, e.g. [1]. Also see Aphyr's Jepsen test of RabbitMQ [2], which demonstrates the problem a bit more rigorously.

With HA mode enabled, it will behave decently during a network partition (which can be caused by non-network-related things: high CPU, for example), but there is no way to safely recover without losing messages. (Note: The frame size issue I mention in that comment has been fixed in one of the latest versions.)

We have also encountered multiple bugs where RabbitMQ will get into a bad state that requires manual recovery. For example, it will suddenly lose all the queue bindings. Or queues will go missing. In several cases the RabbitMQ authors have given me a code snippet to run with the Erlang RELP to fix some internal state table; however, even if you know Erlang, you have to know the deep internals of RabbitMQ in order to think up such a code snippet. There have been a couple of completely unrecoverable incidents where I've simply ended up taking down RabbitMQ, deleted its Mnesia database, and started up a new cluster again. Fortunately, we use RabbitMQ in a way that allows us to do that.

The bugs have been getting fewer over the years, but they're not altogether gone. It's a shame, since RabbitMQ should have a model showcase for Erlang's tremendous support for distribution and fault-tolerance. You're lucky if you've not had any issues with it; personally, I would move away from RabbitMQ in a heartbeat, if we had the resources to rewrite a whole bunch of apps. We've started using NATS for some things where persistence isn't needed, and might look at Kafka for some other applications.

[1] https://news.ycombinator.com/item?id=9448258

[2] https://aphyr.com/posts/315-jepsen-rabbitmq