| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by torinmr 3615 days ago

This was a pretty interesting article that hits very close to home (I'm an SRE at Google). I think the central thesis (that developers are better at running rapidly changing products because they are able to find and fix bugs more quickly) is a bit flawed, however.

The reason is that I think the most valuable contribution of the SRE is not in responding quickly to outages, but in improving the system to avoid outages in the first place. SREs tend to be better at this than developers because (a) they have better knowledge of best practices by virtue of doing this kind of work all day every day and (b) they are more incentivized to prioritize this kind of work.

Because of this, the dynamic I commonly observe is that SRE-run services have fewer and smaller release-related outages because techniques like canarying, gradual rollouts, automated release evaluation, and so forth are deployed to a great extent. On the other hand, developer run services tend to have more frequent and larger release-related outages because these techniques are not used or are used ineffectively. So even though the developers can diagnose the cause of a release-related bug more efficiently than SREs can, the SRE service is still more reliable.

In my view, the main reasons to have developers support their own services fall into (a) there aren't enough SREs to support everything, (b) the service is small enough that investing the kind of manpower SRE would into implementing these best practices would not be cost effective, and (c) SRE support can be used as a carrot to get developers to improve their own services.

Edit: I would add that if the roll of oncall is expected to include only carrying the pager, and not making substantial contributions to improve the reliability of the system, then the author is absolutely right that having an SRE or similar carry the pager has next to no benefit.

1 comments

TheCoelacanth 3614 days ago

Exactly. Even the most trivial of bugs can't be fixed as quickly as Google would need it to be fixed for that to be their go-to strategy. You simply cannot use that as your response to a show-stopping bug if you have stringent up-time requirements.

link