Hacker News new | ask | show | jobs
by hshdhdhehd 237 days ago
CPU usage I tend to see used for two things. Scaling and maybe diagnostics (for 5% of investigations). Dont alert on it. Maybe alert if you scaled too much though.

I would recommend alerting on reliability. If errors for an endpoint go above whatever yoy judge to set e.g. 1% or 0.1% or 0.01% for a sustained period then alarm.

Maybe do the same for latency.

For hobby projects though I just point a free tier of one of those down detector things at a few urls. I may make a health check url.

Every false alarm should lead to some decision of how to fix e.g. different alarm, different threshold or even just forget that alarm.