Hacker News new | ask | show | jobs
by coredog64 1817 days ago
High CPU alerts are terrible alerts. If I'm paying per instance, I want CPU utilization to be high. If it's low, I'm wasting money. So now what I need is an alert where it's not high, but somewhere between "high and too high". You know, like when there's an arbitrary spike because the Java is doing some GC. Or you have a one minute spike of traffic that fires an Ops Genie alert at 2am but auto-clears between when the on-call engineer wakes up and when they log in to check.

For the love of $DIETY, if you're going to set up CloudWatch monitoring, create custom metrics that map to your business outcomes and alert when those go off the rails.

4 comments

You might want to have two separate alerts for this problem, one labelled WARNING and the other CRITICAL, such as 60 percent CPU usage as a warning and 85 percent CPU usage as a critical situation. You can have two separate SNS Topics for warning and critical alerts. Warning alerts can be thrown to a slack channel and Critical alerts can be configured to invoke the Pager.
What do you do with the Slack messages?
Custom Cloudwatch metrics are expensive to write to making them useful for coarse grained high level service metrics. If you can afford it go ahead but setting up some other cloud native monitoring service may be the way to go.
Certainly not perfect, but I've had very good success alerting load avg over 120 to 150 percent of core count

What's nice, is it catches A variety disk issues as well

I'm sure not perfect for all cases but for me, most of them

If you are running some software that requires an instance, but is in not expected to create load you can put it in a burstable, and setup such an alert, so you know when it is time to upgrade.