| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by zbentley 2221 days ago

I'm glad Kafka is working for you you! Rabbit's HA story has definitely been rough until recently. But I think a few of the issues you describe can be mitigated with a bit better understanding of what's going on.

> Any time any of the consumers would slow down of fail, rabbit would run out of memory and crash, causing sitewide failure.

Not to be glib, but in any brokered system, you to have enough (memory and disk) buffer space to soak up capacity when consumers slow down, within reason. Older (2.x) RabbitMQs did a very poor job rapidly paging queue contents to disk when under memory pressure. Newer versions do better, but you can still run the broker out of memory with a high enough ingress/low enough egress, which brings me to...

It sounds like you did not set your high watermarks correctly (another commenter already pointed this out); RabbitMQ can be configured to reject incoming traffic when over a memory watermark, rather than crash.

However, a couple of things can complicate this: rejection of incoming publishes on already-established connections may not make it back to your clients, if they are poorly behaved (and a lot of AMQP client libraries are poorly behaved) or are not using publisher confirms. Additionally, if your clients do notice that this is happening and continually reattempt to reconnect to RabbitMQ to handle the (actually backpressure due to memory) rejection notification, this connection churn can put massive amounts of strain on the broker, causing it to slow down or hang. In RabbitMQ's defense, connect/disconnect storms will damage many/most other databases as well.

> We experimented with ... "durability", which caused the cluster to crash more frequently and lose more messages

A few things to be aware of regarding durability:

Before RabbitMQ 3-point-something (I want to say 3.2), some poorly chosen Erlang IO-threadpool tunings caused durability to have higher latency than expected with large workloads. Anecdotally, the upgrade from 3.6 to 3.7 also improved performance of disk-persisted workloads.

If you have durability enabled, you should really be using publisher confirms (https://www.rabbitmq.com/confirms.html) as well. This isn't just for assurance that your messages made it; without confirms on, I've seen situations where publishers seem to get "ahead" of Rabbit's ability to persist and enqueue messages internally, causing server hiccups, hangs, and message loss. That's all anecdotal, of course, but I've seen this occur on a lot of different clusters. Pub confirms are a species of backpressure, basically--not from consumers to producers, but from RabbitMQ itself to producers.

When moving a high volume of non-tiny messages (where tiny is <500b), you really need a fast disk. That means the equivalent of NVMe/a write-cache-backed RAID (if on real hardware; ask me about battery relearns killing RabbitMQ sometime ... that was a bad night like the one you described), or paying attention to max-throughput/IOPS if deploying in the cloud (for example, a small EBS gp2 volume may not bring enough throughput, and sometimes you may need to RAID-0 up a pair of sufficiently-sized gp2's to get what you need). And no burst IOPS, ever.

> We experimented with "high availability" mode

You're 100% right about this. RabbitMQ's story in this area was pretty bad until recently. Quorum queues and lots of internal improvements have made the last ~4 years worth of the Rabbit versions behave better in HA configurations. But things can still get really dicey. Always "pause minority" (trade away your uptime for message loss), as the Jepsen article you linked mentioned.

For failure recovery (though it's not that "HA") if you can get single-node durability working well and are using networked disks (e.g. NFS, EBS) or a snapshotted-and-backed-up filesystem, one of the nice things about RabbitMQ's persistence format is that at the instant of crash, all but the very most recent messages are recoverable in the storage layer. That doesn't solve availability, but it does mean you don't have catastrophic data loss when you lose a node (restore a snapshot or reattach the data volume to a replacement server).