|
|
|
|
|
by ngrilly
3569 days ago
|
|
> RabbitMQ is of course flaky for other reasons unrelated to Kubernetes -- I wouldn't run it if I didn't have existing apps that relies on it. I'm surprised because I know teams which are very satisfied with running RabbitMQ at scale. Could you elaborate? |
|
With HA mode enabled, it will behave decently during a network partition (which can be caused by non-network-related things: high CPU, for example), but there is no way to safely recover without losing messages. (Note: The frame size issue I mention in that comment has been fixed in one of the latest versions.)
We have also encountered multiple bugs where RabbitMQ will get into a bad state that requires manual recovery. For example, it will suddenly lose all the queue bindings. Or queues will go missing. In several cases the RabbitMQ authors have given me a code snippet to run with the Erlang RELP to fix some internal state table; however, even if you know Erlang, you have to know the deep internals of RabbitMQ in order to think up such a code snippet. There have been a couple of completely unrecoverable incidents where I've simply ended up taking down RabbitMQ, deleted its Mnesia database, and started up a new cluster again. Fortunately, we use RabbitMQ in a way that allows us to do that.
The bugs have been getting fewer over the years, but they're not altogether gone. It's a shame, since RabbitMQ should have a model showcase for Erlang's tremendous support for distribution and fault-tolerance. You're lucky if you've not had any issues with it; personally, I would move away from RabbitMQ in a heartbeat, if we had the resources to rewrite a whole bunch of apps. We've started using NATS for some things where persistence isn't needed, and might look at Kafka for some other applications.
[1] https://news.ycombinator.com/item?id=9448258
[2] https://aphyr.com/posts/315-jepsen-rabbitmq