Hacker News new | ask | show | jobs
by teyc 5650 days ago
There was a major cascading failure in the power grid a few years back.

I thought there was a case of Amazon outage attributed to the same class of error.

The engineering trade-offs that are required are: 1) to protect the servers themselves from being damaged 2) when servers go offline to protect themselves, this may cause other servers to go offline. 3) to isolate the failure to specific subgroups in a network. 4) to provide enough excess capacity to take the load in the event of an outage

Bugs will occur, no matter how good the engineering is. Clients will need to be smarter, for example - implement some kind of exponential back off depending on whether the network is responsive or not.