Hacker News new | ask | show | jobs
by snowfield 840 days ago
The regression is also due to that a real SRE is expensive. It's cheaper to just get some newly grads to react to alarms following a set runbook of what to do if that alarm triggers.

VERY few companies operate at googles scale. For 99.99% of companies it makes sense to investigate single machine issues.

1 comments

Google SREs also end up investigating single machine issues, fyi.
Yes, but At ScaleĀ®

It's a totally different experience when you have the people who technically own the hardware side of the operations taking no responsibility for the well-being of it, and the people who own the software developing elaborate workarounds for bad machines, and the SREs maintaining blacklists of individual nodes.

In my experience it's fun to do that but only worth it when SLOs are on the line (so a significant number of bad machines).