Hacker News new | ask | show | jobs
by floatingatoll 1964 days ago
If your oldest request was queued 5+ seconds ago in a near-realtime system (such as Slack), CPU usage isn't your biggest problem.

Slack wrote an autoscaling implementation that ignored request queue depth and downsized their cluster based on CPU usage alone, so while they knew how to resolve it, I would not go so far as to say they knew how to prevent it. The mistake of ignoring the maxage of the request queue is perhaps the second most common blind spot in every Ops team I've ever worked with. No insult to my fellow Ops folks, but we've got to stop overlooking this.

1 comments

> The mistake of ignoring the maxage of the request queue is perhaps the second most common blind spot in every Ops team I've ever worked with. No insult to my fellow Ops folks, but we've got to stop overlooking this.

What's the first?

Non-randomized wallclock integers.

For example: “sleep 60 seconds”, “cron 0 * * * * command”, “X-Retry-After: 300”

Found in: recurring jobs, backoff algorithms, oauth tokens.

Found in: ops-created tasks, dev-released software.

I'm building something at Cronitor to help detect those hot-spots! If you want to learn more, email me: shane at cronitor.io
Tell us more here!