Hacker News new | ask | show | jobs
by xyz-x 3009 days ago
For simplicity, let's take RMQ

> Multiplying the load on message brokers, often 2-5x, requiring extensive and costly scale out of the messaging system

Scaling it out is no problem. kubectl apply -f statefulset.yaml replicas=5 — done.

Also, message brokers almost never come under load because they are primarily IO bound, not CPU bound. If your broker comes under heavy CPU usage, you're almost certainly doing something wrong.

> No easy flow-control, fast services can produce and consume messages quickly enough to cause side effects for the slow services that are the bottlenecks and dictate overall system performance. This can cause large swings in performance as queues fill up.

In a RPC based architecture flow control would be even worse, cascade up to the caller and cause timeouts and unbounded ephemeral queues that end up looking like a hung system. Worse even, the thundering herd problem will make you unable to get back up.

> Increased latency for end-to-end processing because of repeated round trips to the messaging broker, which must persist messages to disk each time

No, RMQ does not necessarily persist the message each time; you can get a publisher confirm back if all receivers have claimed the message.

Furthermore, while send-then-ACK latency can be up to the disk latency, you're by no means limited to sending one message at a time.

> Increased difficulty isolating performance issues to an individual service vs the messaging brokers vs side-effects from other services impacting the messaging brokers

You almost never have performance issues affecting the brokers.

Suppose you do; then it's most often because you've let your queues grow too long. Where would you be in a HTTP world? You'd be having an outage.

---

> HTTP-based services (REST, gRPC, etc) offer low latency and high throughput compared to queues. They also have strictly lower resource requirements (network, disk, compute, memory) than a message queue solution.

No, not strictly lower; I can imagine a scenario where you have >80% utilisation on your HTTP server on averge; now you'll have growing ephemeral TCP request queues once in a while; then you have to bump the node stats. Consider the alternative; a node that has no latency requirement; it can be utilised to a higher degree; 90-100%. Always at 100%? Then, scale out based on queue size, then scale down again.

1 comments

RabbitMQ isn't a great study in horizontal scalability and fault tolerance. Its clustering is garbage.

The only safe way to run a multinode RabbitMQ setup is to have it stop (the pause_minority setting) when it detects a network partition. Any other mode is lossy by definition. There's no safe HA mode that isn't lossy.

(RabbitMQ also wants a bunch of CPU and RAM. On Kubernetes you'll want to dedicate entire nodes to it to avoid problems.)

As an aside, I think the message queue model used by RabbitMQ has far outlived its usefulness. The big problem with this data model is that data disappears; there's no replay, and zero visibility into processed data; what should be a database that can always be queried at arbitrary points is instead of a conveyor belt that is always moving forward and discarding its history. (And, problematically, NACK is broken with respect to ordering; you can only discard or put back at the end of the queue, not ask to retry.) Kafka and NATS Streaming get this right.

I used RMQ because I just finished a smaller study of its properties when running it on Kube.

In fact, its clustering on Kubernetes is really well done. It's very easy to get up and running and to use from an operations standpoint.

`pause_minority` yes, so there's a setting for this. No safe HA mode? Could you explain that further? It is not entailed by your previous statements.

I did a load test in the above mentioned study and each node in the 5-node cluster hovered around 300 mCPU and 250 MiB with a throughput of 10 MiB/s of messages in publisher-confirms + consumer auto-ack + 100 inflight —mode. That was 7x our needs so I left it there for now.

A model in computing does not outlive its usefulness because you say so:

- no replay: not needed; we're contrasting with RPC here which also does not have replay - zero visibility: this is false as there are lots of metrics and libraries focused on AMQP - should be a database: no, it should be a networked queue with atomic broadcast in the happy case (unhappy case: see FLP result) - NACK does not do what you think it does; it puts the message as close to the head of the queue as possible and even putting at the back of the queue like RMQ did around v1 is a valid resolution, because you don't get ordering guarantees nor exactly-once in a distributed queue, generally (Kafka does not give you exactly once [see atomic producers RFC on their Wiki and the consumers are not transactional so they don't consume exactly-once either])

So, I guess someone was wrong on the internet. ;)

Notes:

- You have to use publisher confirms - You have to use a HA-mode of at least exactly=N/2+1 nodes on every queue - You have to use acking consumers - You will get duplicate messages - You will have to implement retries