Hacker News new | ask | show | jobs
by everforward 236 days ago
Good alerting is hard, even for those of who are SMEs on it.

My biggest advice is to leverage alerting levels, and to only send high priority alerts for things visible to users.

For alert levels, I usually have 3. P1 (the highest level) is the only one that will fire a phone call/alarm 24/7/365, and only alerts if some kind of very user-visible issue happens (increase in error rate, unacceptable latency, etc). P2 is a mid-tier and only expected to get a response during business hours. That's where I send things that are maybe an issue or can wait, like storage filling up (but not critically so). P3 alerts get sent to a Slack channel, and exist mostly so if you get a P1 alert you can get a quick view of "things that are odd" like CPU spiking.

For monitoring, I try to only page on user-visible issues. Eg I don't routinely monitor CPU usage, because it doesn't correlate to user-visible issues very well. Lots of things can cause CPU to spike, and if it's not impacting users then I don't care. Ditto for network usage, disk IO, etc, etc. Presuming your service does network calls, the 2 things you really care about are success rate and latency. A drop in success rate should trigger a P1 page, and an increase in latency should trigger a P2 alert if it's higher than you'd like but okay and a P1 alert at the "this is impacting users" point. You may want to split those out by endpoint as well, because your acceptable latency probably differs by endpoint.

If your service can't scale, you might also want to adjust those alerts by traffic levels (i.e. if you know you can't handle 10k QPS and you can't scale past 10k QPS, there's no point in paging someone).

You can also add some automation, especially if the apps are stateless. If api-server-5 is behaving weirdly, kill it and spin up a new api-server-5 (or reboot it if physical). A lot of the common first line of defense options are pretty automatable, and can save you from getting paged if an automated restart will fix it. You probably do want some monitoring and rate limiting over that as well, though. E.g. a P2 alert that api-server-5 has been rebooted 4 times today, because repeated reboots are probably an indication of an underlying issue even if reboots temporarily resolve it.

1 comments

Thanks...thinking about using AI to learn about what is actually "important" to the developper or team...tracking the alerts that actually lead to manual interventions or important repo changes...this way, we could always automatically send alerts to tiers...just thinking
You could, but I personally wouldn't for a few reasons.

The first is that it's there are simpler ways that are faster and easier to implement. Just develop a strategy for identifying whether page are actionable. Depends on your software, but most should support tagging or comments. Make a standard for tagging them as "actioned on" or "not actionable", and write a basic script that iterates over the alerts you've gotten in the past 30 or 90 days and shows the number of times the alert fired and what percentage of times it was tagged as unactionable. Set up a meeting to run that report once a week or month, and either remove or reconfigure alerts that are frequently tagged as not actionable.

The second is that I don't AI are great at that kind of number crunching. I'm sure you could get it to work, but if it's not your primary product then that time is sort of wasted. Paying for the tokens is one thing, but messing with RAG for the 85th time trying to get the AI to do the right thing is basically wasted time.

The last is that I don't like per alert costs, because it creates an environment ripe for cost-cutting by making alerting worse. If people have in the back of their head that it costs $0.05 every time an alert fires, the mental bar for "worth creating a low-priority alert" goes up. You don't want that friction to setting up alerts. You may not care about the cost now, but I'd put down money that it becomes a thing at some point. Alerting tends to scale superlinearly with the popularity of the product. You add tiers to the architecture and need to have more alerts for more integration points, and your SLOs tighten so the alerts have to be more finnicky, and suddenly you're spending $2,000 a month just on alert routing.

Thank you...reading from you guys has been great so far