|
|
|
|
|
by leajkinUnk
1928 days ago
|
|
“Big cloud” has had fires take out clusters, and somehow they manage to keep it out of the news. In spite of the redundancy and failover procedures, keeping your data centers running when one of the clusters was recently *on fire* is something that is often only possible due to heroic efforts. When I say “heroic efforts”, that’s in contrast to “ordinary error recovery and failover”, which is the way you’d want to handle a DC fire, because DC fires happen often enough. The thing is, while these big companies have a much larger base of expertise to draw on and simply more staff time to throw at problems, there are factors which incentivize these employees to *increase risk* rather than reduce it. These big companies put pressure on all their engineers to figure out ways to drive down costs. So, while a big cloud provider won’t make a rookie mistake—they won’t forget to run disaster recovery drills, they won’t forget to make backups and run test restores—they *will* do a bunch of calculations to figure out how close to disaster they can run in order to save money. The real disaster will then reveal some false, hidden assumption in their error recovery models. Or in other words, the big companies solve all the easy problems and then create new, hard problems. |
|