Hacker News new | ask | show | jobs
by jhayward 2656 days ago
Most failures of this type end up being a cascading resource exhaustion problem propagated by an un- or mis-analyzed feedback or dependency path. It is frankly amazing it doesn't happen more often.

I'm excluding the other common type of long outage, the head-desking "failover didn't work, backups are horked, it'll take 10's of hours to restore/cold start" kind.