Hacker News new | ask | show | jobs
by jldugger 1276 days ago
On call, but this is the best time of the year to be on call: none of my coworkers are around to break anything. A gut estimate of my pager load is that 80 percent of pages are caused by bad changes pushed by humans, and another 10 percent are caused by changes pushed by humans exercising latent errors in the system, and only the remaining 10 percent random shit breaking.

The main downside is that I also don't have as many people to escalate / help when an outage does occur over the holidays. But that's not as urgent now that say, generator fires are Amazon's problem not mine.

2 comments

This time of year, you do get the pages for things that were always broken, but nobody noticed before, because they only show up when the system has been running without changes for more than two weeks.
This actually happened to us last week in fact.

No deployments revealed how a legacy background processor started losing connections to the message queue and gets stuck in a state where it never reconnects.

Deployments always cycled the pods before the issue manifested.

This is something a (now former) colleague of mine pointed out: that the kubernetes descheduler can enforce a maximum lifetime[0] that sort of forces continual reboots. So if your system cannot tolerate running for a long time continously, this is one method to gracefully restart long running pods.

[0]: https://github.com/kubernetes-sigs/descheduler#podlifetime

“Generator fires are highly correlated with Christmas weather” is the best sales pitch I have heard for cloud services.