| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by theevilsharpie 1710 days ago

> If you read past post mortem, you should notice that configuration induced outages have been the sole category of all large-scale outages.

Is it really that surprising? GCP's services are designed to be fault tolerant, and can easily deal with node and equipment failures.

Bugs and configuration errors are much more difficult to deal with, because the computer is doing what it's been programmed to do, even if that isn't necessarily what they wanted or intended. Correctness-checking tools can catch trivial configuration errors, but problems can still slip through, especially if they only manifest themselves under a production load.

If GCP were repeating literally the same failure over and over again, I could understand the frustration, but I don't think that's the case here. Demanding that GCP avoid all configuration-related outages seems unreasonable -- they would either have to stop any further development (since after all, any change has the potential to cause an outage), or they'd need some type of mechanism for the computer to do what the developers meant rather than what they said, which is will beyond any current or foreseeable technology and would essentially require a Star Trek-level sentient computer.

1 comments

justicezyx 1710 days ago

I told you they are not improving. Not that config induced outages is not nasty...

link

extropy 1710 days ago

It might be a business decision.

More reliability means slower development speed. If you are on the same ballpark as your competition, better invest in development speed than being 10x more reliable.

link

bostik 1710 days ago

And perhaps counter-intuitively: slower development speed often means reduced reliability.

If your development and deploy cadence is slower, you end up batching up more changes in any given deployment. Larger changes => higher likelyhood of something in them being wrong => harder to debug due to delta size => wider effective blast radius.

Fast testing and robust build validation are some of the more important guard rails that allow to move fast and be more reliable at the same time.

link