|
|
|
|
|
by JohnFen
2333 days ago
|
|
I think I have utterly failed to successfully convey the point I was trying to make. My point was not that I expect zero downtime or perfect reliability. My point is that I expect that companies don't consider downtime to be an acceptable and normal thing. |
|
And my point is that if a company isn't doing this, they're idiots. SRE is entirely about planning for downtime. You have incident response procedures to minimize downtime when problems happen. You have tools like error budgets to make explicit your organizational goals. But all of these are predicated on the assumption that incidents (and downtime) are a "normal" thing that will happen.
Again, if SRE's goal is solely to minimize downtime at the cost of other organizational priorities, there's a very simple way to do that: disallow all new features and maintain the same app today. That would easily cut outages for most apps by a factor of ten.
> My point is that I expect that companies don't consider downtime to be an acceptable
So you think its unacceptable to have an SLA? That's a very common way of making explicit the amount of downtime the organization feels is acceptable. This kind of error budgets is just a non-public SLA that's used to guide development, as opposed to pay people. I'm curious what companies you use that publish 100% uptime guarantees, or similar SLAs.