Hacker News new | ask | show | jobs
by fensterblick 967 days ago
Recently, I've heard of several companies folding up SRE and moving individuals to their SWE teams. Rumors are that LinkedIn, Adobe, and Robinhood have done this.

This made me think: is SRE a byproduct of a bubble economy of easy money? Why not operate without the significant added expense of SRE teams?

I wonder if SRE will be around 10 years from now, much like Sys Admins and QA testers have mostly disappeared. Instead, many of those functions are performed by software development teams.

15 comments

>Instead, many of those functions are performed by software development teams.

And they likely won’t be as good as dedicated SRE teams.

But few businesses care about that right now considering layoffs.

Throwing developers at a problem even if it isn’t in their skill set is an industry trend that won’t go away and be more pronounced during downturns.

Full stack developers are a great example of rolling two roles together without twice the pay.

Eh it’s not the same thing. (I’m very full stack with intermittent devops/sre experience).

Full stack means you write code running on back end and front end. 99% of the time the code you write on the FE interfaces with your other code for the BE. It’s pretty coherent and feedback loops are similar.

Devops/SRE on the other hand is very different and I agree we shouldn’t expect software developers be mixing in SRE in their day to day. The skills, tools, mindset, feedback loop, and stress levels are too different.

If you’re not doing simple monoliths then you need a dedicated devops/SRE team.

If you can be good at front and back end and keep up with both of them simultaneously, that's great, but:

- you spend more time to keep up with both of those sectors compared to dedicated front or back end positions

- you context switch more often than dedicated positions

- you spent more time getting good at both of those things

- you removed some amount of communication overhead if there were two positions

You are definitely not being compensated for that extra work and benefit to the business given that full stack salaries are close to front end and back end position salaries.

Is it extra work though, it's not like backend engineers sit around not doing anything because they don't have FE work to do.
I vehemently refuse to do frontend...

Have I done it a lot before the SPA era? yes.

Would I be able to do a half decent job today? Probably.

Would that eat into what brainpower I currently muster to fulfill my backend role? I'm convinced of it.

Can I continue to earn a living in the current market? I'm afraid not for long...

SRE is not a byproduct of a bubble economy. I believe Google has had SREs since the very beginning. But still I think the rest of the point still stands. These days with devops the skill set needed for devs have indeed expanded to have significant overlap with SREs. I expect companies to downsize their SRE teams and distribute responsibilities to devs.

A second major reason is automation. If you read the linked site long enough you'll find that in the early days of Google, SREs did plenty of manual work like deployments while manually watching graphs. They were indispensable then simply because even Google didn't have enough automation in their systems! You can read the story of Sisyphus https://www.usenix.org/sites/default/files/conference/protec... to kind of understand how Google's initial failure of adopting standardized automation ensured job security for SREs.

Pedantically, Google didn't have SREs as the beginning. I asked a very early SRE, Lucas, (https://www.nytimes.com/2002/11/28/technology/postcards-from... and https://hackernoon.com/this-is-going-to-be-huge-google-found...), and he said that in the early days, outages would be really distracting to "the devs like Jeff and Sanjay" and he and a few others ended up forming SRE to handle site reliability more formally during the early days of growth, when Google got a reputation for being fast and scalable and nearly always up.

Lucas helped make one of my favorite Google Historical Artefacts, a crayon chart of search volume. They had to continuously rescale the graph in powers of ten due to exponential growth.

I miss pre-IPO Google and the Internet of that time.

> “These days with devops the skill set needed for devs have indeed expanded to have significant overlap with SREs”

Respectfully disagree on this. SRE is a huge complex realm unto itself. Just understanding how all the cloud components and environments and role systems work together is multiple training courses, let alone how to reliably deploy and run in them.

But modern approaches to dev require the SWEs to understand and model the operation of their software, and in fact program in terms of it — “writing infrastructure” rather than just code.

Lambda functions, for example: you have to understand their performance and scalability characteristics — in turn requiring knowledge of things like the latency added by crossing the boundary between a managed shared service cluster and a VPC — in order to understand how and where to factor things into individual deployable functions.

That is barely tip-toeing across the very edges of SRE land.
Alright, how about expecting devs to repackage their entire until-that-point-SaaS stack into an "appliance" (Kubernetes Helm chart), containing SWE-written resource manifests that define the application's scaling characteristics across arbitrarily-shaped k8s clusters they won't get to see in advance, using only node taints; memory limits for layers of their stack they've never even seen run full-bore before; health checks that multiplex back up to a central monitoring platform; safely-revertible multiphase upgrade rollout behavior that never decreases availability; and so forth;

...and then those same devs being expected to directly debug the behavior of this "appliance" in a client environment (think: someone consuming the "appliance" through the Amazon Marketplace, where this launches the workload into an EKS cluster in the customer's own VPC, with the customer in control of defining that cluster's node pools);

...where this can involve, for example, figuring out that a seemingly-innocent bounded-size Redis cache deployment, needs 10x its steady-state memory, when booting from a persisted AOF file... for some godforsaken reason.

Yea, this is buying and using toys. Need to go down a few layers of abstraction
The idea of ops people who wrote code for deployment and monitoring and had responsibility for incident management and change control existed before Google gave it a name.

Source: I was one at WebTV in 1996, and I worked with people who did it at Xerox PARC and General Magic long before then.

Two part to this,

Is sre a bubble thing.

I never got why SRE existed.(SRE has been my title...) The job responsibilities, care about monitoring, logging, performance, metrics of applications are all things a qualified developer should be doing. Offloading caring about operating the software someone writes to someone else just seems illogical to me. Put the swes on call. If swes think the best way to do something is manual, have them do it them selves, then fire them for being terrible engineers. All these tedious interviews and a SWE doesn't know how the computer they are programing works? Its insane. All that schooling and things like how does the OS work, which is part of an undergrad curriculum, gets offloaded to a career and title mostly made up of self taught sysadmin people? Every good swe Ive known, knew how the os, computer, network works.

> if SRE will be around 10 years from now,

Other tasks that SRE typically does now, generalized automation, provide dev tools and improve dev experience, is being moved to "platform" and teams with those names. I expect it to change significantly.

Oddly, the call to put the SWEs in the on-call rotation was one of the original goals of site reliability engineering as an institutional discipline. The idea at conception was that SREs were expensive, and only after product teams got their act together could they truly justify the cost of full-time reliability engineering support.

It's only in the past 10 years (reasonable people may disagree on that figure) that being a site reliability engineer came to mean being something other than a professional cranky jackass.

What I care about as an SRE is not graphs or performance or even whether my pager stays silent (though, that would be nice). No, I want the product teams to have good enough tools (and, crucially, the knowledge behind them) to keep hitting their goals.

Sometimes, frankly, the monitoring and performance get in the way of that.

> Other tasks that SRE typically does now, generalized automation, provide dev tools and improve dev experience, is being moved to "platform" and teams with those names. I expect it to change significantly.

Yeah, this is my experience, too. "DevOps" (loosely, the trend you describe in the first paragraph) is eating SRE from one end and "Platform" from the other. SRE are basically evolving into "System Engineers" responsible for operating and sometimes developing common infrastructure and its associated tools.

I don't think that's a bad thing at all! Platform engineering is more fun, you're distributing the load of responsibility in a way that's really sensible, and engineers who are directly responsible for tracking regressions, performance, and whatnot ime develop better products.

>s SRE a byproduct of a bubble economy of easy money? Why not operate without the significant added expense of SRE teams?

I'm a SWE SRE. I think in some cases it is better to be folded into a team. In other cases, less so.

One SRE team can support many different dev teams, and often the dev teams are not at all focusing time on the very complicated infra/distributed systems aspect of their job, it's just not something that they worry about day to day.

So it makes sense to have an 'infra' that operates at a different granularity than specialized dev teams.

That may or may not need to be called SRE, or maybe it's an SRE SWE team, or maybe you just call it 'infrastructure' but at a certain scale you have more cross cutting concerns across teams where it's cheaper to split things out that way.

Even Google is doing this now.

I think it’s simply swapping one set of trade offs for another. With dedicated SREs you have true specialists in production operations and their accompanying systems (tooling, alerting, etc) with a clear mandate and ownership of outcomes; but they don’t necessarily have full ownership of what they’re keeping running, and that can cause organizational problems (we want to launch X, SRE says no, or vice versa) and make it so non-SREs take no ownership over their hard-to-support code.

Conversely you can have Eng teams without SREs and most of those organizational/social problems, at the cost of production reliability being only one of many priorities.

I think what’s really happening is that a lot of companies are deciding they don’t care about reliability very much as a business outcome, especially when it comes at the expense (at least in opportunity cost) of less features.

I work for a large cloud service that is not Google where the SRE culture varies heavily depending on which product you’re building. SREs are a necessity to free up devs to do actual dev work. Platform and infra teams should tightly couple SWEs and SREs to keep SWEs accountable, but not responsible for day to day operations of the infra - you’ll never get anything done :)
The fact is many/most SWEs don't have the skillset or interest to do SRE work. While there is a lot of overlap, the work can be quite different between the two areas. SRE basically maps to the sysadmin role of old, which has never really gone away and I don't think it's a product of a "bubble economy".
If you think of an SRE as an expensive sysadmin then yes, you should absolutely scratch that entire org. SRE, by Google's definition, is supposed to contain software engineers with deep systems expertise, not some kind of less-qualified SWEs.
I haven’t noticed that in my corner of one of those mentioned companies. Also I’m not an SRE, but during the height of the recent tech layoffs the only job postings I was seeing was for SRE.
> much like [...] QA testers have mostly disappeared.

Who told you that?

QA isn't going anywhere... someone is doing testing, and that someone is a tester. They can be an s/w engineer by training, but as long as they are testing they are a tester.

With sysadmins, there are fashion waves, where they keep being called different names like DevOps or SRE. I've not heard of such a thing with testing.

> someone is doing testing, and that someone is a -tester- user

excuse me for remembering something surely HN considers a platitude: "everyone has a TEST environment, few are fortunate enough to also have a PROD one"

Well, let me take this seriously for a moment. I believe that companies which don't have dedicated testers today are the same companies which didn't have dedicated testers before.

We really use the language of "users doing the testing" jokingly. No software is written w/o testing, not even very trivial programs would run firs time. So, we just mean that there wasn't enough testing, when we say that.

There is a process, however, that is meant to decrease the number of testers employed. The more testing can be automated, the fewer testers would be necessary... but that hinges on the premise that prior number of testers was somehow sufficient for the amount of testing that was necessary. I believe though that the number of testers hired was a function of budget more than anything else. There's never enough testing, and, in principle, it's hard to see how testing can be exhaustive. So, hopefully, with more automation, it's possible to test more, but, I believe that the number of testers will remain more or less the function of budget.

> With sysadmins, there are fashion waves, where they keep being called different names like DevOps or SRE.

I don't think the name change really originated with Sysadmins. Basically these new titles were created (with narrow definitions) and then other companies said "We are cool like Google, we have SREs now, no Sysadmins" so all the jobs had new titles.

Source: Me and my last 4 jobs ( Sysadmin -> Devops Engineer -> Infrastructure Developer -> SRE ) which are all basically the same thing

Sys admins changed name to SREs which changed named to devops engineers or cloud engineers or whatever the title is now.

Still the same competency. Someone needs to know how those protocols work, tell you latency characteristics of storage, and read those core dumps.

In my G SRE interview, I had to do the same rigorous Software Engineering algorithms rounds as well as show deep distributed systems knowledge in designing highly available systems.
If by rigorous algorithms you mean, spend a month memorizing a few dozen leetcode problems then sure, I’ll agree that is sadly the state of SRE interviews at FAANG.
I interviewed at multiple FAANGs and not one of the questions they asked was on leetcode or hackerrank. (I searched afterwords).
> is SRE a byproduct of a bubble economy of easy money?

I think it’s definitely one of the aspects.

Talking with some SRE friends the point that they think part of their role is important are the multitude of moving parts in the current development environment (partially related with the easy money for resume driven development and a lot of tech stack side quests) and how the bar had lowered to hire folks (for this one with hiring managers with almost infinite budget e a lot of questionable product initiatives).

I imagine the threshold is something like 1 SRE for every $1mm of high-margin revenue you can link to guaranteeing the 2nd "9" of $product availability/reliability.
I believe that is indeed a good guide for when it makes sense to have a SRE team supporting a service or product (with the caveat that the number probably isn't $1MM).

There are also good patterns for ensuring you actually have adequate SRE coverage for what the business needs. 2 x 6ppl teams geo-graphically dispersed doing 7x12 shifts works pretty well (not cheap). You can do it with less but you run into more challenges when individuals leave / get burnt out / etc.

That’s sort of ridiculous. A mid-level SRE easily costs a quarter of that. And a company like Apple would then have 80,000 SREs? Lol no.
I think you've perhaps misread my post?

It's marginal revenue attributable to a high-performing SRE (i.e. an SRE who would be able to elevate a product they're supporting from 90.0% availability to 99.0% availability.

It's actually a pretty high bar, because there aren't that many products for which the that segment of availability translates to >$1mm in marginal revenue. $1mm is a ballpark figure, but I think it's the right order of magnitude (i.e. the true number might be $5mm).

Expanding on another point in the original post: decision varies with the profitability of that marginal revenue. For example, it's basically pure profit for Google, Amazon or Netflix – accordingly, it makes sense that they'd have many people who focus exclusively on performance and availability, to make sure they aren't leaving that revenue on the ground.

SysAdmins didn't disappear, they just learned some cloud stuff and changed titles. We call them "DevOps Engineers" now.
I guess now they have a team of software engineers, where part is focused on infra and part on backend. Sys Admins disappeared? They are DevOps/IT Engineers now. QA? SWE in Test, and so on.