| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by leajkinUnk 1928 days ago

“Big cloud” has had fires take out clusters, and somehow they manage to keep it out of the news. In spite of the redundancy and failover procedures, keeping your data centers running when one of the clusters was recently *on fire* is something that is often only possible due to heroic efforts.

When I say “heroic efforts”, that’s in contrast to “ordinary error recovery and failover”, which is the way you’d want to handle a DC fire, because DC fires happen often enough.

The thing is, while these big companies have a much larger base of expertise to draw on and simply more staff time to throw at problems, there are factors which incentivize these employees to *increase risk* rather than reduce it.

These big companies put pressure on all their engineers to figure out ways to drive down costs. So, while a big cloud provider won’t make a rookie mistake—they won’t forget to run disaster recovery drills, they won’t forget to make backups and run test restores—they *will* do a bunch of calculations to figure out how close to disaster they can run in order to save money. The real disaster will then reveal some false, hidden assumption in their error recovery models.

Or in other words, the big companies solve all the easy problems and then create new, hard problems.

2 comments

exikyut 1928 days ago

I'm curious what references or leads I might follow to learn more about these fires and other events you mention.

link

leajkinUnk 1928 days ago

Get a job working at these companies and go out for drinks with the old-timers.

link

pm90 1928 days ago

You know, those are excellent observations. But they don’t change the decision calculus in this case. Using bigger cloud providers doesn’t eliminate all risk, it just creates a different kind of risk.

What we call “progress” in humanity is just putting our best efforts into reducing or eliminating the problems we know how to solve without realizing the problems they may create further down the line. The only way to know for sure is to try it, see how it goes, and then re-evaluate later.

California had issues with many forest fires. They put out all fires. Turns out, that solution creates a bigger problem down the line with humongous uncontrollable fires which would not have happened if the smaller fires had not been put out so frequently. Oops.

link