| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by analogpixel 54 days ago
	> Alerts should be actionable. If no action can or should be taken, then the alert is not needed. Also, the best alerts come from looking at actual failures you had and not trying to make up "good alerts" from thin air. After you have an outage, figure out what alerts would have caught it, and implement those.

3 comments

muvlon 54 days ago

This is one category of good alerts, but not everything.

I think alerts are to ops what tests are to dev. You have "unit alerts" for some small thing like the disk usage on a single host, "integration alerts" like literally "does the page load?" and then what you describe are "regression alerts", trying to prevent something that went wrong once from going wrong again. These are great but just like you wouldn't have 100% regression tests, I think it's also smart to try to get ahead of failures and have some common sense alerts defined.

link

perarneng 54 days ago

"looking at actual failures you had "

Also looking at failures others had, prior experience from yourself and others contribute to good alerts. You don't have to wait for failure to implement most of them. Most of that knowlege is also trained in to most LLM's nowadays. Just ask and then also verify sources, then implement. If you get to many alerts question if you needed them or if its noice. Its a constant trimming until you find the perfect alert setup.

link

esafak 54 days ago

I know something is going to happen if disk space runs out; I don't need to experience it first.

link

stackskipton 54 days ago

Sure, but for every alert, there is exception.

ElasticSearch for example can be configured using ILM policies to fill up the disk then start deleting old records. I don't need to be woken up for disk filling up on those nodes.

Even worse is CPU/RAM alerts.

link

esafak 54 days ago

Alerts are for when things don't go as expected. You set up log rotation but an agent quietly breaks it or ES introduces a bug in it.

link

ajanuary 54 days ago

The number of times I've had to explain how the JVM heap works...

link