Hacker News new | ask | show | jobs
by shvedsky 1806 days ago
There is an inherent speed / reliability tradeoff that is extremely difficult to solve inside one message bus. When you get to truly large systems with a lot of nines of reliability, it starts to make sense to use two systems:

1. Fast system that delivers messages very quickly but is not always partition-tolerant or available 2. Slower, partition tolerant system with high availability but also higher latency (i.e. a database)

The author goes through this in the very first section. Webhook events will eventually start getting lost often enough for the developer to think about a backup mechanism.

Long-polling works if you have a lot of memory on your database frontend. Most shared databases want none of your long-running requests to occupy their memory which is better used for caches.

Even if your message bus has the ability to store and re-deliver events, you might want to limit this ability (by assigning a low TTL). Consider that the consumer microservice enters and recovers from an outage. In the meantime, the producer's events will accummulate in the message service. At the same time, the consumer often doesn't need to consume each individual event but rather some "end state" of some entity or a document. If all lost events were to get re-delivered, the consumers wouldn't be able to handle them, and would enter an outage again. This is where deliberately decreasing the reliability of the message bus and rely on polling would automatically recover the service.

There are other reasons, of course. The author is absolutely correct in their statement, though: whenever a system is implemented using hooks / messages, its developers always end up supplementing it with polling.