|
|
|
|
|
by ahakanbaba
3233 days ago
|
|
I absolutely agree. For a service with any availability guarantees there has to be rigorous monitoring and alerting. This also holds for services that have internal clients. In other words, if your output is consumed only by other services in the same company, the same high monitoring standards must apply. Otherwise failure detection becomes very delayed and the productivity of many teams gets affected. There is no worse buzzkill than explaining other service owners what is wrong with their application. One other important lesson we have earned is that alerts require time to mature. The thresholds need to be trained, the alert formulation needs to be revised. Our alerts usually give couple of false positives in the first two weeks of their creation. During these two weeks we frequently improve the conditions of alerts. |
|