| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by poisonta 880 days ago
	I'm curious whether the success of Google in launching software that seems not fully developed can be attributed to their Site Reliability Engineering (SRE) practices.

2 comments

dpbriggs 880 days ago

Not really, the company is massive and until recently very motivated (promo) to launch new things. SREs probably helped get things across the finish line but likely didn't start those projects.

link

oblio 879 days ago

> Not really, the company is massive and until recently very motivated (promo) to launch new things.

What changed?

link

dpbriggs 879 days ago

Layoffs and a change in the performance management system (moving away from perf).

link

oblio 879 days ago

> change in the performance management system (moving away from perf)

This sounds... confusing. They moved away from performance?

link

bananapub 880 days ago

this is a dumb comment, but yes, part of the role of SREs was helping people make (and then implement) trade-offs around system deployment while deploying things that basically worked as intended.

link

eichin 880 days ago

As I understand it (from friends who were SREs in the 2010s) the really clever bit was that projects basically had a budget for "how much SRE attention your deployment needed" - so there was payoff for getting more deployment details right the first time, and structural pushback for just throwing things over the wall. Sounded like an interesting way to connect up the levers...

link

poisonta 873 days ago

It seems that there may be issues with accountability within their development teams. The reliability of Google Cloud is in question, as encountering 500 errors appears to be a frequent problem. It has been observed that if one persists in retrying a request, it may eventually succeed. This suggests that their teams may have an error budget and might not take action until the issue is flagged by their Site Reliability Engineering (SRE) team.

link