Testing distributed systems is much, much harder than doing so on a monolithic codebase. The number of failure modes goes up very rapidly with the number of nodes in the system & your code has to (in principle) cope with every possible one.
Randomly killing instances wouldn't have detected this particular failure mode as far as I can see, since the error lay in the inability to resurrect a failed process under certain circumstances.