Hacker News new | ask | show | jobs
by pja 5125 days ago
Testing distributed systems is much, much harder than doing so on a monolithic codebase. The number of failure modes goes up very rapidly with the number of nodes in the system & your code has to (in principle) cope with every possible one.
1 comments

true, but it sounds like they (and perhaps you) have never even heard of the chaos monkey.
Randomly killing instances wouldn't have detected this particular failure mode as far as I can see, since the error lay in the inability to resurrect a failed process under certain circumstances.