Hacker News new | ask | show | jobs
by throwaway892238 1480 days ago
I think this myth exists because Google was (is?) famously obsessed with SWE. But if you actually read the SRE books and look at the actual discipline of SRE ("what's the difference between SWE and SRE?"), SRE is quite blatantly just operations management. The website is a power plant, and the SRE runs the power plant. You don't build parts to run a power plant, you use software (as in manipulate/control/operate) to run it. You act quickly when the numbers go out of line, you write reports and control how much power is going in and out, respond to surges and dips, etc.

For whatever reason, Google decided to tell people that the same person who's building the klaxon and the concrete wall and the pipes for the power plant, and the person who's operating the power plant, are one in the same. But that's clearly bunk. Building a part and running a system are completely different disciplines, and anyone who does both will only be half good at both. Humans are shit at multitasking and there's few true polymaths out there. Show me a master programmer and I'll show you an amateur woodworker.

I also don't believe software engineering principles will help you reduce operational complexity. If anything, software engineering tends to either make things either inefficient or subtly complicated. Reducing operational complexity comes from the discipline of operations, which isn't engineering. Non-tech companies have known about these distinctions for like a hundred years. Deming applied scientific rigor and analysis to come up with better practices, but he didn't have to design any widgets to do it.

3 comments

> For whatever reason, Google decided to tell people that the same person who's building the klaxon and the concrete wall and the pipes for the power plant, and the person who's operating the power plant, are one in the same. But that's clearly bunk. Building a part and running a system are completely different disciplines, and anyone who does both will only be half good at both

Depending on the team, SREs can absolutely involved with "building the system", especially the klaxon ;) Examples include designing and implementating metrics used to make make decisions in business logic and or exposed to customers/users, writing routing components like mixers and proxies, developing data pipelines, etc. At Google many SRE teams build and run entire multi-tenant systems with no pure SWEs involved at all.

Healthy SRE teams should be spending 20% of their time on operations. On my team its actually the devs who do most of the operations work. They take the pager during business hours and we route most maintenance tickets to them.

“[…] and we route most maintenance tickets to them.”

My difficulty is that mandated separation of responsibilities within our org is preventing us from embedding ops in dev.

Anyone successfully fought against this and have tips?

One company I worked for opened a position for an ops person on the team.

They shadow-IT’d their way to launch and we’re hugely successful, not the business is largely re-orging to better fit the paradigm.

Was a big gamble. The wrong person could have left a mountain of tech-debt.

The website is NOT a power plant, it's just code. In software, "operations management" is basically infrastructure automation, incident response and build and release. All of these require some software development or at least code literacy and familiarity with software development practices. If there's large overlap in technical skill between the operators and the builders, then it makes more sense to see them as the same but focussing on different problems.
It's probably useful to talk about what Operations Management is first. It's a business discipline that touches on many parts of a business. It is defined as "the management of an organization's productive resources or its production system, which converts inputs into the organization's products and services". You can get a PhD in Operations Management.

In tech, software and data is the "productive resources", and the "production system" is the actual system you build out of those resources: the website, API, etc. You don't have to write any software to build and manage that production system. Maybe that's unusual to people in tech today, but it's a fact that you don't have to write a single line of code to build and operate such a system. Heroku, PagerDuty, DataDog, Splunk, Octopus, AWS, etc, all are products built with the sole purpose of enabling operations without the need to write code. You can assemble logging, alerting, monitoring, web server, networking, database, deployment, etc, without ever writing a single line of code, and have it be highly available and highly reliable.

The title will vary (Systems Engineer, Operations Engineer, DevOps Engineer, Site Reliability Engineer, Systems Administrator) but the job is the same: to use Operations Management techniques to ensure the products and services are productive. You can use software development practices for all of this, sure! But they are absolutely not a requirement to accomplish the goal. And many other roles in the company are involved in Operations (QA, PM, DM, etc) and may or may not use code. The business doesn't care about code, it cares that its resources are being used properly and the production line is operating nominally.

In terms of the distinction between builders and operators: you could say that a construction worker and a custodial worker are part of the same occupation because a lot of their skills overlap. They both need to understand how the building works and may need to build/repair parts of it at times. But they're still two different disciplines that require different training, experience, and day to day responsibilities, and as such we don't lump them into the same category.

THere's at least one big issue here, which is that you're pretending that a website is like a building or a dam. If that were the case, a company like Google would have a (relatively) small team of SWEs who "built" things, and a much larger team of SREs who maintained them over their operational lifecycle once the SWEs were done building the thing. But that isn't the case.

Software systems (at least in competitive consumer markets) are constantly changing and evolving. To use the dam analogy, there's constantly a team of people making the dam taller or wider or deeper, even while the dam is running and producing power.

All the SRE teams I've worked with have done a bunch of things that go beyond "operations". They are usually consulted at the design stage, to make sure that the thing is going to be built reliably. They're also responsible for ensuring ongoing reliability as all new features are added. That means that the features themselves don't impact reliability, and that the process of adding new features doesn't impact reliability. None of this work has a reasonable analogue in your dam analogy, except perhaps as some combination of consultant and regulatory body.

"You don't have to write any software to build and manage that production system."

It depends on the scale and complexity of your application. At some scale/complexity, it absolutely requires writing software because your IAAS provider doesn't provide you with automation that covers 100% of your operational needs and even they recommend using infrastructure as code tools to manage your infra.

If your production system is a CRUD service with 3 application nodes and a managed PostgreSQL instance then you do not need to write software to manage it. But if your application is that simple, then I'd suggest you probably don't need a software developer to build it (Wordpress, Wix).

Construction vs custodian is not a fair analogy because their training and evaluation doesn't really overlap. The training and eval for both "dev" vs "systems" engineers is very similar; most have CS degrees and have to do some leet coding to get the job. Devs generally need to be better at algos, systems engineers need better understanding of networking, os, system design.

> I also don't believe software engineering principles will help you reduce operational complexity.

This isn't a goal of SRE, in my opinion nor in anything I can recall reading. The goal of applying software engineering principles is to accept the increased complexity in exchange for a reduced operational burden.

There's layers of that effect, and the right one depends on largely on your operational burden. Sysadmins shun complexity, so systems are simple but doing mass updates requires a lot of manpower. DevOps embraces some complexity like Ansible or manually orchestrated containers making it easier to do mass updates, but still a burden. SRE embraces complexity, in exchange for a dramatic reduction in manual effort on many tasks.

The idea is that at certain scales (or reliability requirements), it becomes cheaper to hire a small number of expensive people that can manage complex systems than it is to hire a large number of people each managing a simple system.

Software engineering arises because it can effectively trade complexity for reduced operational burdens in exactly the areas you want. You don't have to migrate to a new infrastructure orchestration tool, you can just write an orchestration tool on top of what's there (which I've actually seen done). Was it perfect? No. Was it cheaper than migrating a half million containers to Kubernetes? Yes.

Operations management tends to be very inflexible. They have a set of tools, and anything outside those tools is either a no go or will require replacing an old tool at the cost of months of effort.