Hacker News new | ask | show | jobs
by graycat 3664 days ago
Yup.

Scaling? Can scale about as much as you want. Get some racks of high end servers and a fast server farm LAN, collect the data with whatever instrumentation, and let it run.

For false alarm rate, just shovel in the training data, pick a false alarm rate, say, one a week, one a month, and let it run. Don't hand tune anything. In effect, the hand tuning is replaced by the adjustable false alarm rate and where you know the rate in advance and, with the statistical assumptions, get that rate exactly in practice. Statistical hypothesis testing has had adjustable false alarm rates for 100+ years. Sorry that computer monitoring has been struggling with that.

Yes, you will need to make a judgment call about the assumption that the system now is statistically the same as during the past, say, three months of training or historical data of apparently healthy behavior.

For rates of false alarms, rates of missed detections of real problems -- the server farm bridge staff and the network operations center (NOC) can understand those. I was invited for a free lunch and gave a presentation to the operations staff at the main NASDAQ facility in Trumbull, CT, and the operations staff, maybe 30 people, understood the basics right away.

For the math, the only tricky part, and the core of my paper, is how the heck to know and adjust the false alarm rate, but operationally for the staff that is just trivially easy.

Once I was at Morgan Stanley and talking to their main Unix system administrator. He'd just come from a meeting on how the heck to monitor his Unix systems, and I explained my work. He nearly jumped up and exclaimed "We can use this right away!". But they didn't give me an offer, ask me to consult (my paper was not published yet), etc. So, really, they didn't much care.

For how to report the alarms, I'd guess feed into some standard system management infrastructure, consoles, whatever from HP, CA, EMC, Microsoft, etc. There's a way to get a real time, running strip chart that says what the false alarm rate would be for the data just observed to be an alarm. If people want to watch 500 of those, okay by me. But mostly would want a way to display the strip chart for detectors that just gave alarms or, given an alarm for, say, a Cisco switch, display the strip charts for all the monitors of that switch -- for insight and an aid to diagnosis.

> The simplest example is trying to "baseline" CPU usage. CPU usage without something trivial like comparing to run-queue is stupid.

Of course it is. My work would still give the selected rate of false alarms, but the detector would likely also have poor detection rate. I.e., with just CPU usage, the poor detector just doesn't have enough information to do anything very good.

Now THAT'S in part why want to be multi-dimensional. So, say, feed in PAIRS of CPU busy and run-queue length. Maybe include some more variables, e.g., time of day.

So, here's your first judgment call: What to monitor and what variables to combine to several variables to feed to some one detector. It's clear you have some insight. Good. In time there will be some good ideas for what variables to use to monitor a Cisco switch, an Oracle database, a Windows Server, etc.

I have in mind some more research to help make such selections, but, again, for 20+ years no one was interested. You described the problems well, and I made good progress on solving them, but, still, no one was interested. No one. Did I mention, no one? The paper was right there in a peer-reviewed journal, and it was treated like a source of leprosy. Not my fault. And this is not nearly the first time I've publicized this on HN.

Really, there's hardly a well known VC firm that hasn't heard from me. And there's hardly a one I ever heard back from. What is it, you can lead a horse to water, but you can't make him drink?

Next issue, you already know about: Given a detection, the staff will want a diagnosis of the cause, then the root cause, and then the fix. Well, right, given some topology or some such of the detectors, could do some root cause analysis. But, in practice, diagnosis can be difficult. To ease the work of diagnosis, in each detector try to use some variables that, given an alarm, do give a hint about the cause and diagnosis. Or, have several detectors monitoring one server and, considering them jointly, that is, which ones just gave an alarm and which ones didn't, get some hints on cause -- right, could do more useful research here (I mentioned that).

The only VC that called me back wanted not just my all nicely automated detection but also nicely clear diagnosis and, no doubt, correction. Maybe he also wanted me to give Godzilla a bath, manicure, and rub down, too -- no problem, guys! Godzilla bath coming right up with release 2.0 and the Gold Enterprise Edition! I told the VC that anyone who promised to do a good job automating diagnosis in a real and large server farm or network was, uh, exaggerating what they could do and to stay far away.

Typically diagnosis takes a lot of information about the system being monitored. To do really well at diagnosis, likely need already to have seen all the causes and their symptoms. While we should collect data and make progress where we can, that is, for the more common problems, in general we can't do diagnosis well easily.

Any questions?