|
|
|
|
|
by lobster_johnson
3008 days ago
|
|
RabbitMQ isn't a great study in horizontal scalability and fault tolerance. Its clustering is garbage. The only safe way to run a multinode RabbitMQ setup is to have it stop (the pause_minority setting) when it detects a network partition. Any other mode is lossy by definition. There's no safe HA mode that isn't lossy. (RabbitMQ also wants a bunch of CPU and RAM. On Kubernetes you'll want to dedicate entire nodes to it to avoid problems.) As an aside, I think the message queue model used by RabbitMQ has far outlived its usefulness. The big problem with this data model is that data disappears; there's no replay, and zero visibility into processed data; what should be a database that can always be queried at arbitrary points is instead of a conveyor belt that is always moving forward and discarding its history. (And, problematically, NACK is broken with respect to ordering; you can only discard or put back at the end of the queue, not ask to retry.) Kafka and NATS Streaming get this right. |
|
In fact, its clustering on Kubernetes is really well done. It's very easy to get up and running and to use from an operations standpoint.
`pause_minority` yes, so there's a setting for this. No safe HA mode? Could you explain that further? It is not entailed by your previous statements.
I did a load test in the above mentioned study and each node in the 5-node cluster hovered around 300 mCPU and 250 MiB with a throughput of 10 MiB/s of messages in publisher-confirms + consumer auto-ack + 100 inflight —mode. That was 7x our needs so I left it there for now.
A model in computing does not outlive its usefulness because you say so:
- no replay: not needed; we're contrasting with RPC here which also does not have replay - zero visibility: this is false as there are lots of metrics and libraries focused on AMQP - should be a database: no, it should be a networked queue with atomic broadcast in the happy case (unhappy case: see FLP result) - NACK does not do what you think it does; it puts the message as close to the head of the queue as possible and even putting at the back of the queue like RMQ did around v1 is a valid resolution, because you don't get ordering guarantees nor exactly-once in a distributed queue, generally (Kafka does not give you exactly once [see atomic producers RFC on their Wiki and the consumers are not transactional so they don't consume exactly-once either])
So, I guess someone was wrong on the internet. ;)
Notes:
- You have to use publisher confirms - You have to use a HA-mode of at least exactly=N/2+1 nodes on every queue - You have to use acking consumers - You will get duplicate messages - You will have to implement retries