| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by andrewcooke 5125 days ago
	weird. they had a problem caused by a series of bugs, yet the word "test" doesn't appear anywhere in that page.

1 comments

pja 5125 days ago

Testing distributed systems is much, much harder than doing so on a monolithic codebase. The number of failure modes goes up very rapidly with the number of nodes in the system & your code has to (in principle) cope with every possible one.

link

andrewcooke 5124 days ago

true, but it sounds like they (and perhaps you) have never even heard of the chaos monkey.

link

pja 5124 days ago

Randomly killing instances wouldn't have detected this particular failure mode as far as I can see, since the error lay in the inability to resurrect a failed process under certain circumstances.

link