Hacker News new | ask | show | jobs
by somsak2 970 days ago
If you work at a company on a team with a user-facing surface, you need to have an on-call rotation. Consumer expectations of uptime are extremely high and there's no way to build software that ensures 100% reliability without any human intervention.
4 comments

Fundamentally disagree with this. In fact, most of human technology has worked without on-call rotations. From the complicated movements of mechanical watches, to the still-ticking heartbeats of the Voyager probes.

Most software engineers just happen to be bad engineers. On-call rotations are a band-aid for poor planning and development (to be fair, often imposed by tech-delinquent middle managers or executives).

Fundamentally disagree with this but I can see how someone who thinks at the level of "Most software engineers just happen to be bad engineers" would see it that way.

I have worked at firms of various sizes and there is always a point where things usually become complex at some point. Software is as much or more so about managing the humans than it is the software. This becomes especially true as the firm grows in size. Like all fields there are certainly some individuals that perform better/worse than others but even for the best engineers out there, mistakes happen, edge cases pop up especially as the potential complexity grows. Of course these mistakes can pop up more frequently depending on the imposed deadlines. Deadlines to me are a healthy balancing act between the different parts of the business. Sometimes they are arbitrary but I think in a healthy relationship it helps to have that pushback/friction to figure out how much effort is required.

That was a long way of saying I think its a pretty naive and dismissive view to just hand wave and say this is both due to bad engineers and tech-delinquent middle managers. You are not asking for it either but I think this also comes down to social ability/skills. If your worldview is that most software engineers I can only imagine this shows up in the workplace.

So how did IBM manage to produce mainframes, with software/firmware included, that achieves six 9s uptime?
Obviously the truth is somewhere in between, there is no engineer so talented that they can produce a 100% reliable and available system, and the percentage goes down the more complex the system gets. The decision of whether to have a on-call rotation should be based on the consequences of downtime, not on some kind of moral stance on human fallibility.
Couldn’t disagree more.

Nobody engineers software like spacecraft companies- that doesn’t mean that no one else gives a shit, it just means that cost constraints are real things.

Also, the demands on spacecraft software are trivial (“move the camera once a month”, “do a correction burn after a planetary encounter”, “watch this sensor and do this if it changes”) compared to a modern web application at a Fortune 500 company.

I don't really understand the analogy. Mechanical watches have created an industry of skilled practitioners trained to fix them because they often need diagnosis and repair. NASA has a room full of oncall personnel available for many hours any time they launch anything, and they launch things much less frequently than your average tech company.

Oncall rotations are part of defense-in-depth against bugs and unforeseen circumstances: Most of the companies that survive without a formal one only do so by outsourcing this for the most common cases; to Cloudflare, to Amazon, etc. -- if there's an opportunity cost to being down someone needs to be able to pick up the phone when there's an outage or critical issue.

Let’s get rid of on call firemen while we’re at it they shouldn’t really be required.

Let’s all plan our emergencies to 8am to 5pm, Monday to Friday. And don’t forget the scheduled lunch break at 1pm!

Being on call is firemen job, and they do have shifts. In most of the companies you have something akin to construction workers that work from 9 to 5 and then also are expected to be available to do casual firefighting from 5 to 9 because it's "easier".
At a certain scale mature software companies do in fact have dedicated incident managers, and dedicated SREs who work primarily on stability from a more systematic perspective. However they still need support from the application developers due to the nature of software.

In the old days operations tended to be very isolated in much the way you are proposing. The problem with this is that stability depends very much on the software, so over time operations folks would be extremely defensive and impose all kinds of constraints on what software could do, and the software engineers would be frustrated that they couldn't do things efficiently. Imagine how firefighters would feel if construction workers had a tendency to randomly leave explosives and gas cans hidden throughout new construction and then waltzed off to the next job while the firefighters had to deal with the consequences.

At the end of the day, devs need to have some skin in the game or it's a recipe for disaster.

The "skin in the game" argument is, IMO, not compelling. It is clearly possible to have stable software services delivered by separate Dev and Ops teams that communicate using a will defined software interface -- look at any app that uses Heroku or a similar PaaS.

But, as we know, useful software interfaces are difficult to define well and, once they exist, they tend to be the most inflexible part of a fast-changing system. It is always better (though of course more expensive) to control both sides of an interface for this reason.

The "skin in the game" argument elides this fundamental reason and substitutes one that implies all of this is the fault of lazy devs, which isn't (generally) true IME.

This argument doesn't hold water. Both Heroku and teams that run apps on Heroku have their own on-call teams. Yes, you can build stable interfaces and separation of responsibilities between infra and business services, but someone still has to be responsible for the business services stability.
The juxtaposition of "this is how it's done because otherwise those that fight fires impose restrictions that those that build don't like" and "construction workers Vs firefighters" seems to be undermining your point...

In mature industries, there absolutely are plenty of regulations in place to make sure that builders don't make responders' life harder. That doesn't mean that the responders aren't needed, but the fact that the software industry as a whole decided to go all "response is the only thing we need for most things" is evidence that it is not mature.

Sure, any point can be undermined by overgeneralization and bad analogies.

The nature of software and physical construction is different.

They can be scheduled on shifts instead of being on call.

It just requires spending more money hiring more people.

John Deere sells tractors because of their oncall process

When you're in the middle of harvesting and your tractor breaks down, you want it fixed now, not at somebody else's convenience.

Mechanics and tow trucks also form an oncall for broken down cars.

There's oncalls all over the place for tech of all kinds. I think the biggest difference with software is intellectual property - we've made it so nobody else is allowed to fix whatever's broken, so of course, we need oncalls to fix the problems instead of letting customers go to their preferred mechanic

You can't compare a mechanical watch to a modern web app. If you told the people building the modern web app were allowed to make one release per decade and not allowed to "maintain it" you would see the feature list drop off, the release date shoot out.

If you have different constraints, you get a different result.

In a vacuum, technology does not need people on-call. The problem is when you're iterating on technology (E.g. SaaS).

Shipping things has the highest risk of breaking something. I worked with an SRE team responsible for managing incidents for a few months, they told me that ~80% of incidents are caused by bad code being shipped, and I saw that happen as well.

Modern software applications are complex and interconnected. It's pretty easy to unintentionally break something in a different part of the application, or ship subtly bad code because you aren't intimately familiar with that part of the codebase.

Not a great comparison. I presume you are trolling.

Mechanical watches have been around for roughly 500 years (about 4x-5x tbe time since the first program, depending on how you count), which is a substantial amount of time to iterate on core functionality. Even then, watches until the 1970s (when quartz was introduced) were often imprecise enough to lose 15 minutes/day.

The Voyager probes have both lost several instruments, were built with substantial amounts of redundancy, all to the adjusted for inflation cost of about $3.94B US dollars. Maintenance per year is estimated to be about $5M, including the occasional software update.

This seems incredibly naive considering companies where engineers are pushing literally thousands of commits a day. Things will break, period.
I only had one experience with this early in my career, but I remember it wasn't good. Basically, we had a 5-6 person team that had to support like 10 important legacy services. One person essentially ended up being on call for a week, having to wake up early to debug issues with a thing you didn't build in the first place and only vaguely understood.

Do you think this company had good CI/CD and automated tests? They did not. There was fortunately a lot of monitoring so at least you knew when the service was in a bad state, but absolutely nothing else other than a ticket and an angry customer.

I would much rather have extensive test coverage, very good CI/CD, make sure not to do releases on weekends and holidays, and have a few people whose job is to do the monitoring and escalate to the right people rather than just putting a target on a random engineer's back and hope they can fix things quickly.

On call rotation is one thing.

Everyone being motivated to develop in a way that isn’t resulting in brittle software and breaking and maybe even use boring tech for stability so even if an on call is a real thing, it’s relatively benign. It’s not a bad thing to rely on the genius of smart people who have been at it for decades and are decades ahead in some realizations.

Having software that self reports and logs multiple occurrences of the same errors and escalating errors that start in the app are one great way to stay ahead of issues.

By the time someone reaches out it’s easy to find the error, session, user, and say ok we see it and are on it. Acknowledgement at this depth upfront quite often let’s the customers to say it’s ok take a look on Monday. It’s reassuring. Also easy to forward such an issue to a distributed team.

You think that, but when Github is down, or AWS-East is down, everyone just kind of endures it, waiting until it's back up. And life goes on.
Right, and the reason "everyone" has confidence in just waiting a bit for it to go back up without panicking and having to try to find an alternative is precisely because GitHub and AWS have on-call engineers that are alerted and immediately work on fixing it.
Of course, when a hospital’s IT systems go down, people do indeed have to endure it, but life does not always in fact go on [0]. Horses for courses, I guess.

[0] https://www.theverge.com/2021/9/27/22696097/hospital-ransomw...

Hospital IT systems failures are are mostly going to cause problems with scheduling, record-keeping, and billing.

By far, the most likely thing to kill you in a hospital is not the IT system but errors by the doctors and nurses. Or that you're too sick to save no matter what they do.