|
|
|
|
|
by theevilsharpie
1663 days ago
|
|
> If you read past post mortem, you should notice that configuration induced outages have been the sole category of all large-scale outages. Is it really that surprising? GCP's services are designed to be fault tolerant, and can easily deal with node and equipment failures. Bugs and configuration errors are much more difficult to deal with, because the computer is doing what it's been programmed to do, even if that isn't necessarily what they wanted or intended. Correctness-checking tools can catch trivial configuration errors, but problems can still slip through, especially if they only manifest themselves under a production load. If GCP were repeating literally the same failure over and over again, I could understand the frustration, but I don't think that's the case here. Demanding that GCP avoid all configuration-related outages seems unreasonable -- they would either have to stop any further development (since after all, any change has the potential to cause an outage), or they'd need some type of mechanism for the computer to do what the developers meant rather than what they said, which is will beyond any current or foreseeable technology and would essentially require a Star Trek-level sentient computer. |
|