|
|
|
|
|
by thinkersilver
2542 days ago
|
|
Am not sure how much this discussion is happening now but do remember having these conversations more often a few years ago when tools like Prometheus were coming up. This is my take and shouldn't be taken as gospel but I've observed that metrics are perfect for isolating and identifying an issue and logs attempt to explain the behaviour seen. Having a layered dashboard set up top to bottom showing the throughput, error rates and latencies expressing the health of each layer from the exposed service or api all the way down to the subsystems and hardware supporting them is a good place to start. I'd write more but don't want to make this post overly long. There are several useful articles online on the different methodologies and approaches to achieve this. There isn't much conversation though about the knowledge an organisation has around incident analysis workflows, how past incident resolutions are captured and integrated with dashboards in monitoring systems and how they are shared and reused as SREs and engineers work with log analytics. I've seen the same lessons having to be relearned when new engineers join a team. Checklists are a good example( there are more ways)for directing incident analysis in large complex systems but how many times do you see this being supported by the monitoring system? The focus is always on displaying pretty charts in a grid when more value could be gleaned from the same dashboard content presented as a rich living document with charts, integrated checklists in something resembling a guide to resolving an issue. There certainly is enough data to achieve this. |
|
Do you have some of these articles in mind?