Hacker News new | ask | show | jobs
by liamdiprose 2333 days ago
From Google's Site Reliability Engineering book[1]:

> SRE’s goal is no longer "zero outages"; rather, SREs and product developers aim to spend the error budget getting maximum feature velocity. This change makes all the difference. An outage is no longer a "bad" thing—it is an expected part of the process of innovation, and an occurrence that both development and SRE teams manage rather than fear.

I suspect Search has a lower error budget than Sheets.

[1] https://landing.google.com/sre/sre-book/chapters/introductio...

2 comments

This interested me enough to read some of that chapter. Here are a few quotes that give more context:

> Traditional operations teams and their counterparts in product development thus often end up in conflict, most visibly over how quickly software can be released to production. At their core, the development teams want to launch new features and see them adopted by users. At their core, the ops teams want to make sure the service doesn’t break while they are holding the pager. Because most outages are caused by some kind of change—a new configuration, a new feature launch, or a new type of user traffic—the two teams’ goals are fundamentally in tension.

...

> The use of an error budget resolves the structural conflict of incentives between development and SRE. SRE’s goal is no longer "zero outages"; rather, SREs and product developers aim to spend the error budget getting maximum feature velocity. This change makes all the difference.

...

> ...the decision to stop releases for the remainder of the quarter once an error budget is depleted might not be embraced by a product development team unless mandated by their management.

> An outage is no longer a "bad" thing—it is an expected part of the process of innovation, and an occurrence that both development and SRE teams manage rather than fear.

As a user, outages are always bad things. That Google's SRE team thinks otherwise is chilling.

Of course outages are bad. But if the choice is "our service occasionally goes down" and "we never release new features", you may accept the risk of occasionally going down.

So yeah, outages are always bad, but the alternatives can be worse.

> if the choice is "our service occasionally goes down" and "we never release new features", you may accept the risk of occasionally going down.

I don't think that I would, because I don't accept the premise of that being the necessary choice. It's just the choice that the providers deign to offer for economic reasons.

But my objection isn't that there should be zero downtime. My objection is the idea that a service provider considers any downtime to be acceptable.

> My objection is the idea that a service provider considers any downtime to be acceptable.

If you don't view any downtime to be acceptable, the logical thing to do is invest all of your resources into reducing downtime. This means solely investing in reliability infrastructure, redundancy, and making few or no changes to the system, since change introduces failure.

Since no service does that, the logical conclusion is that very few people actually consider any downtime unacceptable. Broadly speaking, I can think of literally no service that advertises "zero downtime". Cold storage gets close, but even they offer a measly 12 or 16 9s of reliability.

In other words, reliability is a business goal, much like any other business goal. Trying to achieve "perfect" reliability with limited resources isn't a good time. So looking at error budgets empowers SREs. You can go to leaders and say "hey we're exceeding our error budget, so we not making any more changes and only working on reliability until we're back within our agreed reliability."

I think I have utterly failed to successfully convey the point I was trying to make.

My point was not that I expect zero downtime or perfect reliability. My point is that I expect that companies don't consider downtime to be an acceptable and normal thing.

> My point is that I expect that companies don't consider downtime to be an acceptable and normal thing.

And my point is that if a company isn't doing this, they're idiots. SRE is entirely about planning for downtime. You have incident response procedures to minimize downtime when problems happen. You have tools like error budgets to make explicit your organizational goals. But all of these are predicated on the assumption that incidents (and downtime) are a "normal" thing that will happen.

Again, if SRE's goal is solely to minimize downtime at the cost of other organizational priorities, there's a very simple way to do that: disallow all new features and maintain the same app today. That would easily cut outages for most apps by a factor of ten.

> My point is that I expect that companies don't consider downtime to be an acceptable

So you think its unacceptable to have an SLA? That's a very common way of making explicit the amount of downtime the organization feels is acceptable. This kind of error budgets is just a non-public SLA that's used to guide development, as opposed to pay people. I'm curious what companies you use that publish 100% uptime guarantees, or similar SLAs.

"Chilling" is a bit of an overstatement, no? Maybe if we were talking about ATC systems or medical device firmware, etc., it would be more fitting.
You could set the failure budget really low if you never want any new features, but many users do want new releases now and then.