| HN Mirror

Part II

There is another cute point: The usual way to use the historical data has its order ignored. So, should be able to get some cute theorems about this that permit assuming much less than that the input data is independent and identically distributed (i.i.d.). That is, get to assume that the historical data has been permuted randomly, and, I'm guessing that, from this could get a cute approximation result. One start would be the work of Michel Talagrand on "a new look at independence". Why have I not pursued this research? Did I mention that I have 20+ years of evidence that anomaly detection and a dime won't cover a 10 cent cup of coffee?

Then, with the math clear, the data collected, etc., need to write some code. I put my feet up and thought of a way. Soon I learned that part of my work reinvented k-D trees, that is, for positive integer k, search trees in k-dimensions. IIRC, k-D trees are in

Robert Sedgewick and Kevin Wayne, Algorithms, FOURTH EDITION, Addison-Wesley, New York, 2011.

So, k-D trees are a lot like a k dimensional generalization of the usual one dimensional binary search.

But, there, also need some cutting planes and a little backtracking in the tree. For this, for the computer hardware in serious production, the new several TB solid state disk (SSD) drives would be just terrific: Load up one of those with a k-D tree of historical data, and then in production deployment many times a second query that data. Since are using the data -- write, say, once a week or month and read hundreds or thousands of times a second -- the SSD would give fantastically fast data rates and not wear out.

Of course, more could be done:

(1) How much historical data is really needed?

(2) Of the data on several variables, which are needed? Or, maybe we should prove some theorems that show when too many variables without enough historical data hurt the results?

(3) Should we think about scaling the data on some of the variables? If so, then how to know what variables and how much scaling? Could want some useful theorems here.

(4) For a system level view, could we be hierarchical, that is, say that this one server is sick, with another detector drill-down and say that this one virtual machine on that server is sick, drill down and say that this one applications program is sick, or some such? In all cases? No. In some cases, maybe?

(5) After detection with a low false alarm rate, the next step is diagnosis. But detectors that in some useful sense localize the source of the sickness should be able to help with diagnosis and, indeed, the third step, correction. So, a question is, at what scales and where to deploy detectors? Could use some useful theorems, analysis, etc. here.

(6) Networks and server farms are changing constantly, but need some good historical data that still describes a healthy system. Okay, could use some work to show when the changes have been enough to need new historical data.

(7) Maybe of high interest to the suits in the C-suite, do some on decision theory, that is, essentially cost minimization. So, have costs for false alarms and costs for missed detections and try to set the false alarm rate, and/or the number of detectors, etc., to minimize the sum of all these costs.

Since a missed detection, i.e., a problem detected too late instead of ASAP, can be headlines for IIRC Sony, Target, the NYSE, Yahoo, etc., the suits might be eager for relatively a lot of such monitoring.

(8) My work was for problems never seen before. The 50,000 foot view of that is that, once have seen a problem, detected it, diagnosed it, and corrected it, then, with the corrections, really shouldn't see the problem again. So, really what we should be looking for are problems never seen before and should slap our own wrists for any problem we keep seeing over and over.

Given some e-mail addresses, I'll send a reference to my published paper. But I'm trying to be anonymous here at HN.

Again, I'm no longer motivated by money to pursue anomaly detection: For 20+ years I got overwhelming evidence that work in anomaly detection and a dime won't cover a 10 cent cup of coffee. I tried to make a startup out of this work, but gotta tell you, after hundreds of e-mail messages to venture capital firms, about 98% said nothing back, one gave a weak reply with no indication of any real understanding of the market, technical challenges, my work, what such a company would look like, etc., and the remaining 2%, or maybe 1%, gave only some boiler plate about "not in our focus area".

So, I have another startup in progress, easier to do, easier to sell, much more promising and close to going live and making money.

So, here I'm willing to do an anomaly detection technology give away! Get your free anomaly detection tutorial and technology!

Uh, when I did the research at IBM's Watson lab, the Watson lab patent office was getting excited by my work -- "there are several projects in the lab attacking anomaly detection, and your work is the only one taking your approach.".

I had my research in a nicely written paper, complete with theorems and proofs and some good tests on some real data (from a computer cluster at Allstate -- the cluster was part of the motivation to be multi-dimensional, e.g., to detect some of the failures due to chain reactions in clustering) and also some hypothetical data that would make a severe test (the checkerboard example is a simplification), and a guy at IBM claimed to have my work "reviewed" and pronounced the work as "not publishable," and I got fired, walked out the door. Not good. He didn't give me a chance to submit my work for publication.

But, out of IBM, with a copy of my paper and a letter from IBM permitting me to publish, I submitted the paper, and it was accepted without revision (except something about indenting the first paragraph after a subheading!) by the first journal where I submitted, Information Sciences, a good enough journal.

I've been burned enough pursuing anomaly detection. I don't like being burned; it's no fun.

My conclusion is that any very good work in the field is just way too darned difficult for essentially everyone else that would be involved -- management chain, journal reviewers, computer science departments, venture capital partners, suits at target customers, etc.

Right, as suggested elsewhere in this thread, try to sell a solution or a service, maybe cloud based. But setting that up and getting it to good traction would take some cash, more than I'm going to put forward.