Hacker News new | ask | show | jobs
by lobster_johnson 3008 days ago
RabbitMQ isn't a great study in horizontal scalability and fault tolerance. Its clustering is garbage.

The only safe way to run a multinode RabbitMQ setup is to have it stop (the pause_minority setting) when it detects a network partition. Any other mode is lossy by definition. There's no safe HA mode that isn't lossy.

(RabbitMQ also wants a bunch of CPU and RAM. On Kubernetes you'll want to dedicate entire nodes to it to avoid problems.)

As an aside, I think the message queue model used by RabbitMQ has far outlived its usefulness. The big problem with this data model is that data disappears; there's no replay, and zero visibility into processed data; what should be a database that can always be queried at arbitrary points is instead of a conveyor belt that is always moving forward and discarding its history. (And, problematically, NACK is broken with respect to ordering; you can only discard or put back at the end of the queue, not ask to retry.) Kafka and NATS Streaming get this right.

1 comments

I used RMQ because I just finished a smaller study of its properties when running it on Kube.

In fact, its clustering on Kubernetes is really well done. It's very easy to get up and running and to use from an operations standpoint.

`pause_minority` yes, so there's a setting for this. No safe HA mode? Could you explain that further? It is not entailed by your previous statements.

I did a load test in the above mentioned study and each node in the 5-node cluster hovered around 300 mCPU and 250 MiB with a throughput of 10 MiB/s of messages in publisher-confirms + consumer auto-ack + 100 inflight —mode. That was 7x our needs so I left it there for now.

A model in computing does not outlive its usefulness because you say so:

- no replay: not needed; we're contrasting with RPC here which also does not have replay - zero visibility: this is false as there are lots of metrics and libraries focused on AMQP - should be a database: no, it should be a networked queue with atomic broadcast in the happy case (unhappy case: see FLP result) - NACK does not do what you think it does; it puts the message as close to the head of the queue as possible and even putting at the back of the queue like RMQ did around v1 is a valid resolution, because you don't get ordering guarantees nor exactly-once in a distributed queue, generally (Kafka does not give you exactly once [see atomic producers RFC on their Wiki and the consumers are not transactional so they don't consume exactly-once either])

So, I guess someone was wrong on the internet. ;)

Notes:

- You have to use publisher confirms - You have to use a HA-mode of at least exactly=N/2+1 nodes on every queue - You have to use acking consumers - You will get duplicate messages - You will have to implement retries