|
|
|
|
|
by hshdhdhehd
237 days ago
|
|
CPU usage I tend to see used for two things. Scaling and maybe diagnostics (for 5% of investigations). Dont alert on it. Maybe alert if you scaled too much though. I would recommend alerting on reliability. If errors for an endpoint go above whatever yoy judge to set e.g. 1% or 0.1% or 0.01% for a sustained period then alarm. Maybe do the same for latency. For hobby projects though I just point a free tier of one of those down detector things at a few urls. I may make a health check url. Every false alarm should lead to some decision of how to fix e.g. different alarm, different threshold or even just forget that alarm. |
|