Hacker News new | ask | show | jobs
by tomlue 1319 days ago
I don't work at a giant company, but I'm curious:

> Anyone that has worked on large, complex system knows that the margin of error in uptime and downtime is often whether the right person is within arms’ reach of their laptop.

Is this true?

Shouldn't giant tech companies obsess about reducing the need for human intervention?

8 comments

I'm former AWS. Yes, it's true. You'd be surprised how much human intervention is needed for large-scale SaaS/cloud stuff. A lot of it's just scale and probability. If an IT problem has a 0.0001% chance of happening on any given day for an org, a single organization will likely never see it happen during its entire existence. But if you're managing IT for 10 million organizations, it'll statistically happen 10 times per day!

Giant tech companies do obsess about reducing the need for human intervention. Teams in my org at AWS kept track of failures/intervention rates per thousand instances. If it gets too high, it means you're spending too much engineering effort resolving on-call issues and need to fix it.

Sure they should, but there are a lot of moving parts, written in different decades. Bugs are found every day even in projects that have 100% code coverage

PagerDuty is a multi billion dollar company for a reason, and they're not even the only company doing what they do.

I don't think it's relevant if one has worked in a giant company to understand how bad on call can be, every engineer knows that. I personally assume on-call is much worse/harder/nerve wracking in bigger companies

> Shouldn't giant tech companies obsess about reducing the need for human intervention?

They do. You automate recovery for all the failure modes your system has encountered. Then the system promptly fails in a new way you've never seen before.

Often because some totally different part of the system fails when scaling to new levels.

Yes, but complex systems are always changing and in some ways in a constant process of degrading. A lot of the biggest companies are growing exponentially faster than their processes and it ends up being nearly impossible for the tooling and supporting software to keep up. At that scale all the automation software you buy off the shelf won't scale with you. With over a billion dollars a year in surplus infrastructure costs, I would have to imagine Twitter is at that scale.
At Amazon, near Black Friday and Prime Day, there are company wide deployment freezes, where no one is allowed to push to prod.

When I was oncall for my team, I found there were less pages, less issues, and the system was generally more stable.

Entropy, leading to availability problems, grows with rate of production changes.

If no one touches the code, my guess is the system is more stable rather than less.

Of course this is true, and isn't really a surprise. Very few outages are caused by existing code in a system that was otherwise working perfectly. It's almost always due to some change – whether a bug in newly deployed code, bad config update or whatever else. An untouched system is absolutely more stable than one in flux.

Of course the solution can't really be "let's not deploy anything, just to be safe", because then your competitors are going to launch new features and leave your product behind.

I think Uber published a study about this. Deployments and changes cause most issues.
They do obsess about reducing human intervention, but in every system I've ever seen, you still need humans for the "out of context" problems https://tvtropes.org/pmwiki/pmwiki.php/Main/OutsideContextPr...

For example, one day SRE got alerted that a bunch of expensive accelerators were unexpectedly shutting down and not restarting production. SRE has to reach out in this case to the SWEs who build/designed the system to ask some clarifying questions. Together, the SREs and SWEs form a series of hypotheses about the cause, ultimately discovering an entirely unanticipated failure mode.

I think I'm one of the few people in the world who has attached a $100K oscilloscope to the voltage regulator on a machine learning accelerator to debug why a specific training job that did a series of convolutions at a highly specific rate would cause a DC-DC regulator to act like an AC source. It took far, far longer to write and deploy the rule that detected this problem in prod than it took us to identify the problem and stop the killer job.

It should be viewed as a cat and mouse game.

The general philosophy at these orgs is that the same failures should never happen _again_. So you build automation and safeguards protecting the system from the failure modes you know.

However, any complex system will have failure modes you don't know. There might be new software, new features, new APIs etc. going out that interact in complex ways. So complex systems will fail in very interesting ways. So the general philosophy in operating these systems is:

1. First get the system back into a state in which the problem is mitigated. 2. Apply some short term hacks, rollback any suspicious recent changes. 3. Have someone go a bit deeper and try to root cause what caused the failure, have a discussion about it with impacted teams (often called a postmortem) and come up with long term fixes that reduces or eliminates the root cause from happening again.

I have worked in companies that it was true. And others where it was not. Even in one company we moved from not true to true, in about 8 years (there was of course not a hard cut).

The companies where it was not true, was a pleasure to work in. The management, from C-suits downwards knew what they were doing. In the others, it was a total chaos.

Of course, if everything is fixed with duct tape, you need firemen ready to act. If everything is solid and robust, there can be small outages, but nothing too critical.

Both are true — if a team can figure out how the reduce the need for human intervention, they generally will. In the limit, what's left is the really-hard-to-anticipate/automate stuff.