Hacker News new | ask | show | jobs
by dekhn 1317 days ago
They do obsess about reducing human intervention, but in every system I've ever seen, you still need humans for the "out of context" problems https://tvtropes.org/pmwiki/pmwiki.php/Main/OutsideContextPr...

For example, one day SRE got alerted that a bunch of expensive accelerators were unexpectedly shutting down and not restarting production. SRE has to reach out in this case to the SWEs who build/designed the system to ask some clarifying questions. Together, the SREs and SWEs form a series of hypotheses about the cause, ultimately discovering an entirely unanticipated failure mode.

I think I'm one of the few people in the world who has attached a $100K oscilloscope to the voltage regulator on a machine learning accelerator to debug why a specific training job that did a series of convolutions at a highly specific rate would cause a DC-DC regulator to act like an AC source. It took far, far longer to write and deploy the rule that detected this problem in prod than it took us to identify the problem and stop the killer job.