|
|
|
|
|
by lucaspiller
4886 days ago
|
|
One thing I worry about with something like this, is that if you hide the detection and fixing behind layers of automation, things that are actually broken (or at least buggy in my example below) might get missed. For example say you have a simple service, however every ~2 weeks it needs to be restarted because of a memory leak. If a human is in charge of this after having to go in and restart the service a few times every 2 weeks they'll know that something isn't right here. If it's automated though, the computer won't have this intuition. What if it is based on the number of requests served, and your traffic is sporadic, so the first time it is 2 weeks, then 3 days, then a month? |
|
Sometimes things fly under the radar, though, and that's where the dashboard-style "situational awareness" UIs really shine. Typically the people asking for a "dashboard" are executives who really shouldn't care, who only need regular briefings plus an occasional text from someone in operations warning them of a major customer impact. The people who benefit from them are engineers who browse around the system looking for trouble or simply satisfying their curiosity. "The Foo servers handle the Bar requests. I wonder what their typical CPU utilization is. I'll go check one of them... click click click. Whoa, that memory usage doesn't look good. I wonder if it's always like that. click WTF is this erratic sawtooth pattern? Do the other Foo servers have this, too? click click click Yeesh, somebody needs to fix that." That's the ideal case, anyway, if you have a rich UI that is good at presenting pages of data in context that can be understood at a glance, with quick navigation to related data. If the engineer clicks the "CPU utilization" button and gets back a line graph and a table of numbers, with no other context, then the UI is forcing the engineer to have tunnel vision. It should be dashboards all the way down, until the engineer starts running custom queries that the system doesn't know how to provide context for.
But yeah, the chronic restarting scenario should show up in reports and hopefully trigger an alert. I imagine that routine interventions (such as spinning up extra servers for load) and troubling interventions (such as restarting a service) are distinguished in reporting.