Hacker News new | ask | show | jobs
by lobster_johnson 3569 days ago
As long as you associate a separate service with each RabbitMQ pod, you can make it work without petsets. (Setting the hostname inside the pod is trivial, just make sure it matches.) Then you can create a "headless" service for clients to connect to, which matches against all the pods.

If you set it up in HA mode, then in theory you don't need persistent volumes, although RabbitMQ is of course flaky for other reasons unrelated to Kubernetes -- I wouldn't run it if I didn't have existing apps that relies on it.

1 comments

> RabbitMQ is of course flaky for other reasons unrelated to Kubernetes -- I wouldn't run it if I didn't have existing apps that relies on it.

I'm surprised because I know teams which are very satisfied with running RabbitMQ at scale. Could you elaborate?

RabbitMQ doesn't have a good clustering story. The clustering was added after the fact, and it shows. I've written about it on HN several times before, e.g. [1]. Also see Aphyr's Jepsen test of RabbitMQ [2], which demonstrates the problem a bit more rigorously.

With HA mode enabled, it will behave decently during a network partition (which can be caused by non-network-related things: high CPU, for example), but there is no way to safely recover without losing messages. (Note: The frame size issue I mention in that comment has been fixed in one of the latest versions.)

We have also encountered multiple bugs where RabbitMQ will get into a bad state that requires manual recovery. For example, it will suddenly lose all the queue bindings. Or queues will go missing. In several cases the RabbitMQ authors have given me a code snippet to run with the Erlang RELP to fix some internal state table; however, even if you know Erlang, you have to know the deep internals of RabbitMQ in order to think up such a code snippet. There have been a couple of completely unrecoverable incidents where I've simply ended up taking down RabbitMQ, deleted its Mnesia database, and started up a new cluster again. Fortunately, we use RabbitMQ in a way that allows us to do that.

The bugs have been getting fewer over the years, but they're not altogether gone. It's a shame, since RabbitMQ should have a model showcase for Erlang's tremendous support for distribution and fault-tolerance. You're lucky if you've not had any issues with it; personally, I would move away from RabbitMQ in a heartbeat, if we had the resources to rewrite a whole bunch of apps. We've started using NATS for some things where persistence isn't needed, and might look at Kafka for some other applications.

[1] https://news.ycombinator.com/item?id=9448258

[2] https://aphyr.com/posts/315-jepsen-rabbitmq

Thanks a lot for elaborating. This is exactly the kind of insights I wanted to know.