| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by jldugger 1276 days ago
	On call, but this is the best time of the year to be on call: none of my coworkers are around to break anything. A gut estimate of my pager load is that 80 percent of pages are caused by bad changes pushed by humans, and another 10 percent are caused by changes pushed by humans exercising latent errors in the system, and only the remaining 10 percent random shit breaking. The main downside is that I also don't have as many people to escalate / help when an outage does occur over the holidays. But that's not as urgent now that say, generator fires are Amazon's problem not mine.

2 comments

toast0 1276 days ago

This time of year, you do get the pages for things that were always broken, but nobody noticed before, because they only show up when the system has been running without changes for more than two weeks.

link

purrcat259 1275 days ago

This actually happened to us last week in fact.

No deployments revealed how a legacy background processor started losing connections to the message queue and gets stuck in a state where it never reconnects.

Deployments always cycled the pods before the issue manifested.

link

jldugger 1275 days ago

This is something a (now former) colleague of mine pointed out: that the kubernetes descheduler can enforce a maximum lifetime[0] that sort of forces continual reboots. So if your system cannot tolerate running for a long time continously, this is one method to gracefully restart long running pods.

[0]: https://github.com/kubernetes-sigs/descheduler#podlifetime

link

mulmen 1276 days ago

“Generator fires are highly correlated with Christmas weather” is the best sales pitch I have heard for cloud services.

link