| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by floatingatoll 1964 days ago
	If your oldest request was queued 5+ seconds ago in a near-realtime system (such as Slack), CPU usage isn't your biggest problem. Slack wrote an autoscaling implementation that ignored request queue depth and downsized their cluster based on CPU usage alone, so while they knew how to resolve it, I would not go so far as to say they knew how to prevent it. The mistake of ignoring the maxage of the request queue is perhaps the second most common blind spot in every Ops team I've ever worked with. No insult to my fellow Ops folks, but we've got to stop overlooking this.

1 comments

nicoburns 1964 days ago

> The mistake of ignoring the maxage of the request queue is perhaps the second most common blind spot in every Ops team I've ever worked with. No insult to my fellow Ops folks, but we've got to stop overlooking this.

What's the first?

link

floatingatoll 1964 days ago

Non-randomized wallclock integers.

For example: “sleep 60 seconds”, “cron 0 * * * * command”, “X-Retry-After: 300”

Found in: recurring jobs, backoff algorithms, oauth tokens.

Found in: ops-created tasks, dev-released software.

link

encoderer 1964 days ago

I'm building something at Cronitor to help detect those hot-spots! If you want to learn more, email me: shane at cronitor.io

link

floatingatoll 1964 days ago

Tell us more here!

link