|
I don't agree with his fundamental premise: > Network Partitions are Rare, Server Failures are Not Network partitions happen all the time. Sure, the whole "a switch failed and that piece of the network isn't there anymore" doesn't happen a lot, but what does happen a lot is a slow or delayed connection, or a machine going offline for a few seconds. |
Even VMs on more statically allocated clouds like DigitalOcean and AWS will experience small, constant blips that affect your whole stack.
What annoys me in particular is that these blips affect everything. Every app needs to fail gracefully, be it a PostgreSQL client connections, a Memcached lookup or an S3 API call. The fact that such catch-and-retry boilerplate logic needs to built into the application layer, and every layer within it, is still something I find rather insane. It leaks into the application logic in often rather insidious ways, or in ways that pollutes your code with defenses. Everything has to be idempotent, which is easy enough for transactional database stuff, less easy for things like asynchronous queues that fire off emails. Erlang has already provided a solution to the problem, but I suspect we need OS-level support to avoid reinventing the wheel in every language and platform. /rant