Hacker News new | ask | show | jobs
by georgewfraser 526 days ago
Like so many things from Google engineering this will be toxic to your startup. SREs read stuff like this, they get main character syndrome and start redoing the technical designs of all the other teams, and not in a good way.

This phenomenon can occur in all “overlay” functions, for example the legal department will try to run the entire company if you don’t have a good leader who keeps the team in their lane.

2 comments

In my experience, SREs are usually "enforcers of maintainability". If your engineers don't want to be oncall, they need to produce applications and services that are documented and maintainable. It's an amazing forcing function. SRE doesn't often redo technical designs, there's plenty enough reliability and scalability work to do...
Your engineers should be on call.
At a 200-person company, sure. But when you're in the tens or hundreds of thousands, that's a hard no. Especially when dealing with out-of-scope dependencies.
I work for a company with millions of employees. Our SDEs and their managers carry and are responsible for answering the pagers. We don’t have SREs.
What? Engineers should own the code they write, including being on call to maintain it. Out-of-scope dependencies should be irrelevant, and if they're not, get some of those tens or hundreds of thousands of employees to work on better observability.

I agree that if you own the blahblah service then you shouldn't get alerts for a broken dependency foobaz if that team is already aware, but if blahblah itself breaks, not being around to fix it is pretty dangerous.

> But when you're in the tens or hundreds of thousands, that's a hard no.

What? No, not at all. I worked in such a company,and oncall was indeed a thing and it was tremendously easy to deal with upstream and downstream dependencies. You have dashboards monitoring metrics emitted in calls to-from third party services and run books that made it quite clear who to call when any of the dependencies misbehaved. If anything happened, everyone was briefed and even on a call.

This boils down to ownership and accountability. It means nothing if the company had 10 or 100k employees.

From the 90s the whole DNS on which the internet is standing today was run successfully with minimum error by a bunch of folks who used to call themselves sysadmins. Developers seems to run out of things to develop and they have been reinventing themselves as devops and SREs. They have been pushing out pure sysadmins but at the same time this trend shows how demand for developers or SWEs falls far short of the supply of developers in the market.
Take one look at the Kubernetes source code and it becomes clear that you can make successful software with zero clue about good software engineering.
What is objectively bad code in the code base of k8s? Is it really worse than any other system?
Yes it is worse, much worse. A large part of the reason for that is that it's written in Go. The other part is that it's written by Googlers and sysadmin people; two groups not particularly known for their great software engineering skills. My personal experience here is mostly with cAdvisor (which I guess is not strictly part of Kubernetes but comes from the same ecosystem). It is chock full of horrible error handling (if there is any), uninitialized structures and a dozen layers of indirection.