Hacker News new | ask | show | jobs
by poisonta 833 days ago
I'm curious whether the success of Google in launching software that seems not fully developed can be attributed to their Site Reliability Engineering (SRE) practices.
2 comments

Not really, the company is massive and until recently very motivated (promo) to launch new things. SREs probably helped get things across the finish line but likely didn't start those projects.
> Not really, the company is massive and until recently very motivated (promo) to launch new things.

What changed?

Layoffs and a change in the performance management system (moving away from perf).
> change in the performance management system (moving away from perf)

This sounds... confusing. They moved away from performance?

this is a dumb comment, but yes, part of the role of SREs was helping people make (and then implement) trade-offs around system deployment while deploying things that basically worked as intended.
As I understand it (from friends who were SREs in the 2010s) the really clever bit was that projects basically had a budget for "how much SRE attention your deployment needed" - so there was payoff for getting more deployment details right the first time, and structural pushback for just throwing things over the wall. Sounded like an interesting way to connect up the levers...
It seems that there may be issues with accountability within their development teams. The reliability of Google Cloud is in question, as encountering 500 errors appears to be a frequent problem. It has been observed that if one persists in retrying a request, it may eventually succeed. This suggests that their teams may have an error budget and might not take action until the issue is flagged by their Site Reliability Engineering (SRE) team.