| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by JohnFen 2333 days ago

> if the choice is "our service occasionally goes down" and "we never release new features", you may accept the risk of occasionally going down.

I don't think that I would, because I don't accept the premise of that being the necessary choice. It's just the choice that the providers deign to offer for economic reasons.

But my objection isn't that there should be zero downtime. My objection is the idea that a service provider considers any downtime to be acceptable.

1 comments

joshuamorton 2333 days ago

> My objection is the idea that a service provider considers any downtime to be acceptable.

If you don't view any downtime to be acceptable, the logical thing to do is invest all of your resources into reducing downtime. This means solely investing in reliability infrastructure, redundancy, and making few or no changes to the system, since change introduces failure.

Since no service does that, the logical conclusion is that very few people actually consider any downtime unacceptable. Broadly speaking, I can think of literally no service that advertises "zero downtime". Cold storage gets close, but even they offer a measly 12 or 16 9s of reliability.

In other words, reliability is a business goal, much like any other business goal. Trying to achieve "perfect" reliability with limited resources isn't a good time. So looking at error budgets empowers SREs. You can go to leaders and say "hey we're exceeding our error budget, so we not making any more changes and only working on reliability until we're back within our agreed reliability."

link

JohnFen 2333 days ago

I think I have utterly failed to successfully convey the point I was trying to make.

My point was not that I expect zero downtime or perfect reliability. My point is that I expect that companies don't consider downtime to be an acceptable and normal thing.

link

joshuamorton 2333 days ago

> My point is that I expect that companies don't consider downtime to be an acceptable and normal thing.

And my point is that if a company isn't doing this, they're idiots. SRE is entirely about planning for downtime. You have incident response procedures to minimize downtime when problems happen. You have tools like error budgets to make explicit your organizational goals. But all of these are predicated on the assumption that incidents (and downtime) are a "normal" thing that will happen.

Again, if SRE's goal is solely to minimize downtime at the cost of other organizational priorities, there's a very simple way to do that: disallow all new features and maintain the same app today. That would easily cut outages for most apps by a factor of ten.

> My point is that I expect that companies don't consider downtime to be an acceptable

So you think its unacceptable to have an SLA? That's a very common way of making explicit the amount of downtime the organization feels is acceptable. This kind of error budgets is just a non-public SLA that's used to guide development, as opposed to pay people. I'm curious what companies you use that publish 100% uptime guarantees, or similar SLAs.

link

JohnFen 2332 days ago

> So you think its unacceptable to have an SLA? That's a very common way of making explicit the amount of downtime the organization feels is acceptable.

Perhaps we mean different things by "acceptable". SLAs are a promise that downtime won't exceed certain levels. They are not a declaration that downtime is "acceptable", only that it's inevitable and is an attempt to characterize that inevitability.

What I mean is that when downtime happens, nobody at the company should be think "this is fine". They should be very concerned and engaging in urgent and speedy resolution to the problem.

The idea that a service is expecting and accepting downtime as part of normal operation and, even worse, as part of some sort of tradeoff with regards to developing new features is just bizarre and unacceptable to me.

It indicates a level of unconcern about customer needs and experience that renders the service untrustworthy.

link

joshuamorton 2332 days ago

But again, this just acknowledges reality. You only have a finite number of employees. If you aren't devoting all of them to reliability and stability, you're making a trade off with feature velocity.

Being aware of that trade off is more organizationally mature than not

> What I mean is that when downtime happens, nobody at the company should be think "this is fine". They should be very concerned and engaging in urgent and speedy resolution to the problem.

If you think this, you've entirely misunderstood. Error budgets aren't about outages when they happen. Individual outages should be dealt with quickly and without delay. But when making planning decisions for the next year or quarter, that's when error budgets matter.

link