Hacker News new | ask | show | jobs
by strifey 192 days ago
Staring down the barrel of being primary on-call over Christmas for a dozen k8s clusters running thousands of nodes. How I wish it were true that we could trust computer programs to just keep running.

PagerDuty wouldn't exist if this were true.

1 comments

If your work place has a long enough history, try comparing incidents on work days versus weekends or holidays. Typically the incident rate is dramatically lower when no one is making changes.
Totally true, but we host other people's code (PaaS, etc). We don't get to dictate their working hours.

It also doesn't mean nothing breaks when people aren't making changes. Certificate expiration is the classic example of something breaking _because_ someone hasn't made a change. Or a slow memory leak. There's a whole classification of issues that get worse when nothing is redeployed for long enough.