|
|
|
|
|
by nrr
664 days ago
|
|
One thing to be aware of is that up/down alerting bakes downtime into the incident detection and response process, so literally anything anyone can do to get away from that will help. A lot of the details are pretty application-specific, but the metrics I care about can be broadly classified as "pressure" metrics: CPU pressure, memory pressure, I/O pressure, network pressure, etc. Something that's "overpressure" can manifest as, e.g., excessively paging in and out, a lot of processes/threads stuck in "defunct" state, DNS resolutions failing, and so on. I don't have much of an opinion about push versus pull metrics collection as long as it doesn't melt my switches. They both have their place. (That said, programmable aggregation on the metrics exporter is something that's nice to have.) |
|
But saturation is not the same as errors.