Hacker News new | ask | show | jobs
by bpchaps 2889 days ago
When reading these articles, never forget that your company is NOT Google! If your company doesn't have a management/infrastructure/communication/skill structure that Google has, then it will be very difficult to implement these fundamentals.

In many cases, an SRE is a job to save costs. If your company doesn't get its shit together and doesn't give your SREs the support it needs, then they'll hate their jobs and the company.

3 comments

I have to disagree. The typical and intuitive ways of reasoning about outages and outage risk - screaming at the engineers until they fix it, desperately passing the buck, finding someone to fire in the aftermath - are not a good fit for any context. Every company can benefit from a more principled mental model of system reliability.
If your company's management doesn't even know what an SRE is, then you're stuck in the same exact place, where the SREs are the one being screamed at instead. Some companies just rename "devops" to "SRE".
I think the renaming is fine as long as it also comes with the responsibility of driving the tracking and improving of site reliability :)
I just renamed my microwave to "refrigerator", but all of my food caught on fire and started leaking operational debt! :(
> In many cases, an SRE is a job to save costs.

This is 100% the case. I would actually argue that it is the only job of the SRE organization - hit the budgets by balancing costs of availability vs. costs of unavailability. If the org has massive budgets and general budget flexibility then it is easy. Otherwise, SREs are magicians to pull the rabbits out of a hat inventing the most awe inspiring methods/tools/hacks/workarounds needed to meet and beat budget targets.

In the other orgs SREs are an indirect level of outsourcing of everything to SaaS providers.

Yep. It's pretty disheartening, frankly.
I have no idea why you’re being downvoted. It’s the same thing as Borg/Kubernetes, MapReduce/Hadoop: some things just don’t apply or aren’t as effective unless you’re operating at a huge scale and with Google’s culture.
> unless you’re operating at a huge scale and with Google’s culture.

I'm not sure one has to go to the extreme of huge scale, anywhere near where Google is now, (not that that's what you said), nor all the aspects of their culture, but I agree that key fundamental aspects are often missed.

My favorite example is to point out that Google does not run Hadoop on expensive, virtualized AWS instances (or even brand-name servers with useless-for-purpose features[1] that creep up the cost). Rather, one of their competitive advantages, from the very start, has been to optimize hardware that they purchase, customize, and operate for cost (and performance).

The other is, as you mention, culture, which involves a remarkable amount of specialization, with groups dedicated to hardware, networking, internal tooling (i.e. building and maintaining the Hadoop-euquivalent), and, of course, SRE, who couldn't even begin to do their jobs without all those other groups' support.

Of course, there's an argument to be made that things like k8s and PaaS/IaaS can take the place of all those supporting groups at Google, but my counterargument is that they both fail to impart any benefit of customization (or, conversely cultural benefit of the mindset of doing everything that way across the entire company) and carry a tremendous cost (in money and complexity).

[1] redundant power supplies, high-density chasses, onboard hardware RAID

No idea, either. This whole thread is being bombarded with downvotes.

Downvoters: Whatsup?