|
|
|
|
|
by JohnFen
2333 days ago
|
|
> if the choice is "our service occasionally goes down" and "we never release new features", you may accept the risk of occasionally going down. I don't think that I would, because I don't accept the premise of that being the necessary choice. It's just the choice that the providers deign to offer for economic reasons. But my objection isn't that there should be zero downtime. My objection is the idea that a service provider considers any downtime to be acceptable. |
|
If you don't view any downtime to be acceptable, the logical thing to do is invest all of your resources into reducing downtime. This means solely investing in reliability infrastructure, redundancy, and making few or no changes to the system, since change introduces failure.
Since no service does that, the logical conclusion is that very few people actually consider any downtime unacceptable. Broadly speaking, I can think of literally no service that advertises "zero downtime". Cold storage gets close, but even they offer a measly 12 or 16 9s of reliability.
In other words, reliability is a business goal, much like any other business goal. Trying to achieve "perfect" reliability with limited resources isn't a good time. So looking at error budgets empowers SREs. You can go to leaders and say "hey we're exceeding our error budget, so we not making any more changes and only working on reliability until we're back within our agreed reliability."