|
|
|
|
|
by analogpixel
54 days ago
|
|
> Alerts should be actionable. If no action can or should be taken, then the alert is not needed. Also, the best alerts come from looking at actual failures you had and not trying to make up "good alerts" from thin air. After you have an outage, figure out what alerts would have caught it, and implement those. |
|
I think alerts are to ops what tests are to dev. You have "unit alerts" for some small thing like the disk usage on a single host, "integration alerts" like literally "does the page load?" and then what you describe are "regression alerts", trying to prevent something that went wrong once from going wrong again. These are great but just like you wouldn't have 100% regression tests, I think it's also smart to try to get ahead of failures and have some common sense alerts defined.