| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by teraflop 2365 days ago

Yes, this applies to any distributed system.

It's a basic truth about the computer networks that if you send a message to a machine and don't hear back, you don't know whether it's actually dead or just not able to communicate with you. If a machine finds itself in that situation, one option is to simply wait until the peer eventually comes back to life, or retries enough times to eventually get through.

If it takes any other action, it has to make sure that action is "safe" (w.r.t whatever guarantees it's trying to provide) under either scenario. That's all "partition tolerance" really is: a statement that you, as the developer, have a sort of burden of proof to make sure you've considered all the possible failure scenarios. (If that seems like a tautology, well, that's why we usually don't bother talking about the "P" in the CAP theorem.)

As your comment alludes to, one can often simplify the problem a bit by reducing the number of possible scenarios, by assuming that crashed nodes never recover, they're only replaced by new nodes. But that still doesn't make the problem go away. Unless your network (including both the hardware and the OS network stack!) is infallible, you can't reliably know whether a remote machine has crashed in the first place.

Consider the common situation where you want to provide some kind of transactional guarantees; for instance, transactions should appear to complete in causal order, and their effects should not disappear once committed. That implies that even if a node "looks" dead, it's not safe for another one to take over its role unless it's really, truly dead, or you would risk returning stale (transactionally invalid) results.