| HN Mirror

Automated interventions should be tracked as data in the system, so you can set thresholds and create alerts and reports on them just like you can on anything else.

Sometimes things fly under the radar, though, and that's where the dashboard-style "situational awareness" UIs really shine. Typically the people asking for a "dashboard" are executives who really shouldn't care, who only need regular briefings plus an occasional text from someone in operations warning them of a major customer impact. The people who benefit from them are engineers who browse around the system looking for trouble or simply satisfying their curiosity. "The Foo servers handle the Bar requests. I wonder what their typical CPU utilization is. I'll go check one of them... click click click. Whoa, that memory usage doesn't look good. I wonder if it's always like that. click WTF is this erratic sawtooth pattern? Do the other Foo servers have this, too? click click click Yeesh, somebody needs to fix that." That's the ideal case, anyway, if you have a rich UI that is good at presenting pages of data in context that can be understood at a glance, with quick navigation to related data. If the engineer clicks the "CPU utilization" button and gets back a line graph and a table of numbers, with no other context, then the UI is forcing the engineer to have tunnel vision. It should be dashboards all the way down, until the engineer starts running custom queries that the system doesn't know how to provide context for.

But yeah, the chronic restarting scenario should show up in reports and hopefully trigger an alert. I imagine that routine interventions (such as spinning up extra servers for load) and troubling interventions (such as restarting a service) are distinguished in reporting.