|
Part I Here at HN I'm trying to be anonymous but
on anomaly detection make a contribution. The basic idea of my research is for
positive integer k, use k-nearest
neighbors or, really, any of a wide
variety of such measures, metrics,
functions, windows however want to
regard them. So, right, are on the way to some
approaches to multi-dimensional
probability density estimation. The main contribution of my research is
the applied probability derivations to
show how to calculate, adjust, set, etc.
the false alarm rate. In practice, false
alarm rate unknown, out'a control, often
too high is a biggie problem -- e.g., can
cause staff just to ignore too many
alarms. E.g., see the movie Tora, Tora,
Tora about the 12/7/1941 attack on Pearl
Harbor where the US actually had just what
the heck it needed for early detection,
early enough to get fighter airplanes in
the air, maybe early enough to get the
ships out of the harbor, etc. -- radar on
top of a hill. And the radar operators
DID get the detection and DID report it,
but the chain of command assumed that the
detection was a false alarm. Big, huge,
gigantic bummer. The US came within the
thickness of an onion skin of getting the
US airplanes and ships out of harm's way,
shooting down much of the attacking
airplanes, and having the US carriers,
already at sea, attacking the attacking
carriers when all their airplanes were off
over Hawaii. The US could have sustained
little damage and sunk all six of the
attacking carriers. Having a good anomaly detector and
ignoring it can be a bummer. But if know what the false alarm rate is
and know that it's low, then maybe the
chain of command will take a detection
more seriously and, then, take the next
step -- diagnosis. Sometimes the spook business, say, Soviets
in England, would watch the comings,
goings, lights on late, etc. at British
military HQs and try to detect anomalies
that would indicate war. Sure, for such
detection need both distribution-free and
multi-dimensional with good control over
false alarm rate and good evidence of a
relatively high detection rate. So, sure,
the US CIA should want my work. I wrote
them -- never heard back. Once I did give a talk on my work at the
Institute for Defense Analysis (IDA),
think tank resource for the Joint Chiefs.
Some in the audience mentioned credit card
fraud detection. Okay. But there should
also be national security applications. Once gave a talk on the work at the Johns
Hopkins University Applied Physics Lab
(JHU/APL) -- I might as well have talked
about a recipe for pancakes. Once I gave a talk on the work at the
NASDAQ site in Trumbull, CT. I got a nice
tour, a nice lunch, and a nice tote bag. I gave some talks on this work to my
management chain at IBM's Watson lab --
the suits were afraid I might have some
good work they didn't understand, couldn't
control, and couldn't throttle. They were
PISSED. I presented my work to some IBM people
outside of Watson Research -- they liked
the work, but Watson Research was PISSED. For the applied probability stuff in my
research, for a detection of an anomaly
can also report the lowest false alarm
rate at which the data would still be a
detection. Then right along can graph
these lowest rates and use them as a
heuristic about detection seriousness.
Net, getting a handle, knob to turn,
knowledge of, etc. false alarm rate is
important. The applied probability stuff is based on
some cute symmetry in the calculations.
The symmetry, then, gives some measure
preserving transformations
(transformations that don't change the
probability distribution). So, end up
with a finite algebraic group of such
transformations. Then sum over the group
and, presto, bingo, get what need to know
false alarm rate. To adjust the false alarm rate, adjust the
diameter, cut-off, whatever, of the k
nearest neighbors or other metric,
convolution window, etc. Of course all of
this is conditional probabilities
conditioned on the historical data used
where assume that the system being
monitored when the historical data was
collected was healthy. How to confirm
healthy? Let the data age and see if
there are any other, later symptoms of
problems from other means, e.g., angry
customer complaints, malware, attacks and
intrusions, etc. So, have a case of, essentially, being
brief here, the conditional expectation
E[Y|X] where Y is the random variable of
detections and X is the historical data.
Then of course E[E[Y|X]] = E[Y] which says
that have set false alarm rate "exactly"
in the long-run or, if you will,
conditionally exactly in the short run. A result is that the naive way to evaluate
false alarm rate from the historical data
does, for lots of data, converge to the
right answer. Proving that, however, was
a bit tricky. Achieving the Neyman-Pearson best result
needs lots of data from both healthy and
sick systems, and in practice the second
is in short supply. The Neyman-Pearson result is intuitive,
like investing in real estate: Buy plots
of real estate that give the best return
per dollar invested until have invested
all the available money. Well, suppose the data observed is, for
the set of real numbers R and some
positive integer n, in R^n. Then
partition R^n into little patches. Then
in each little patch, assume that data in
that patch would raise an alarm, and,
then, from the data (need lots of data,
especially as n grows) get the ratio of
the detection rate over the false alarm
rate. Then false alarm rate is really
like the money you have to invest, and
detection rate is your return on that
investment. So, sure, you buy the
patches with the highest ratio until you
have spent, invested, all your false
alarm rate, money. Can find proofs in terms of freshman
calculus. I wrote out a quite general
proof in terms of the Hahn decomposition
from the Radon-Nikodym theorem in measure
theory. So, right, if have only finitely many
patches in R^n, then have essentially a
knapsack problem, IIRC, in NP-complete
(although likely not a concern in
practice). Can write out a little derivation, mostly
just Fubini's theorem (the measure theory
version of the calculus interchange order
of integration) to show in a cute sense of
the distribution of sick systems being
translation invariant (right, a bit much
to assume -- but here I'm typing quickly
from memory of what I did 20+ years ago)
then my techniques give, for the assumed
false alarm rate, the highest detection
rate possible. So, this is a simple,
quick and dirty, data-free, crude approach
to the full Neyman-Pearson result of best
detection. |
There is another cute point: The usual way to use the historical data has its order ignored. So, should be able to get some cute theorems about this that permit assuming much less than that the input data is independent and identically distributed (i.i.d.). That is, get to assume that the historical data has been permuted randomly, and, I'm guessing that, from this could get a cute approximation result. One start would be the work of Michel Talagrand on "a new look at independence". Why have I not pursued this research? Did I mention that I have 20+ years of evidence that anomaly detection and a dime won't cover a 10 cent cup of coffee?
Then, with the math clear, the data collected, etc., need to write some code. I put my feet up and thought of a way. Soon I learned that part of my work reinvented k-D trees, that is, for positive integer k, search trees in k-dimensions. IIRC, k-D trees are in
Robert Sedgewick and Kevin Wayne, Algorithms, FOURTH EDITION, Addison-Wesley, New York, 2011.
So, k-D trees are a lot like a k dimensional generalization of the usual one dimensional binary search.
But, there, also need some cutting planes and a little backtracking in the tree. For this, for the computer hardware in serious production, the new several TB solid state disk (SSD) drives would be just terrific: Load up one of those with a k-D tree of historical data, and then in production deployment many times a second query that data. Since are using the data -- write, say, once a week or month and read hundreds or thousands of times a second -- the SSD would give fantastically fast data rates and not wear out.
Of course, more could be done:
(1) How much historical data is really needed?
(2) Of the data on several variables, which are needed? Or, maybe we should prove some theorems that show when too many variables without enough historical data hurt the results?
(3) Should we think about scaling the data on some of the variables? If so, then how to know what variables and how much scaling? Could want some useful theorems here.
(4) For a system level view, could we be hierarchical, that is, say that this one server is sick, with another detector drill-down and say that this one virtual machine on that server is sick, drill down and say that this one applications program is sick, or some such? In all cases? No. In some cases, maybe?
(5) After detection with a low false alarm rate, the next step is diagnosis. But detectors that in some useful sense localize the source of the sickness should be able to help with diagnosis and, indeed, the third step, correction. So, a question is, at what scales and where to deploy detectors? Could use some useful theorems, analysis, etc. here.
(6) Networks and server farms are changing constantly, but need some good historical data that still describes a healthy system. Okay, could use some work to show when the changes have been enough to need new historical data.
(7) Maybe of high interest to the suits in the C-suite, do some on decision theory, that is, essentially cost minimization. So, have costs for false alarms and costs for missed detections and try to set the false alarm rate, and/or the number of detectors, etc., to minimize the sum of all these costs.
Since a missed detection, i.e., a problem detected too late instead of ASAP, can be headlines for IIRC Sony, Target, the NYSE, Yahoo, etc., the suits might be eager for relatively a lot of such monitoring.
(8) My work was for problems never seen before. The 50,000 foot view of that is that, once have seen a problem, detected it, diagnosed it, and corrected it, then, with the corrections, really shouldn't see the problem again. So, really what we should be looking for are problems never seen before and should slap our own wrists for any problem we keep seeing over and over.
Given some e-mail addresses, I'll send a reference to my published paper. But I'm trying to be anonymous here at HN.
Again, I'm no longer motivated by money to pursue anomaly detection: For 20+ years I got overwhelming evidence that work in anomaly detection and a dime won't cover a 10 cent cup of coffee. I tried to make a startup out of this work, but gotta tell you, after hundreds of e-mail messages to venture capital firms, about 98% said nothing back, one gave a weak reply with no indication of any real understanding of the market, technical challenges, my work, what such a company would look like, etc., and the remaining 2%, or maybe 1%, gave only some boiler plate about "not in our focus area".
So, I have another startup in progress, easier to do, easier to sell, much more promising and close to going live and making money.
So, here I'm willing to do an anomaly detection technology give away! Get your free anomaly detection tutorial and technology!
Uh, when I did the research at IBM's Watson lab, the Watson lab patent office was getting excited by my work -- "there are several projects in the lab attacking anomaly detection, and your work is the only one taking your approach.".
I had my research in a nicely written paper, complete with theorems and proofs and some good tests on some real data (from a computer cluster at Allstate -- the cluster was part of the motivation to be multi-dimensional, e.g., to detect some of the failures due to chain reactions in clustering) and also some hypothetical data that would make a severe test (the checkerboard example is a simplification), and a guy at IBM claimed to have my work "reviewed" and pronounced the work as "not publishable," and I got fired, walked out the door. Not good. He didn't give me a chance to submit my work for publication.
But, out of IBM, with a copy of my paper and a letter from IBM permitting me to publish, I submitted the paper, and it was accepted without revision (except something about indenting the first paragraph after a subheading!) by the first journal where I submitted, Information Sciences, a good enough journal.
I've been burned enough pursuing anomaly detection. I don't like being burned; it's no fun.
My conclusion is that any very good work in the field is just way too darned difficult for essentially everyone else that would be involved -- management chain, journal reviewers, computer science departments, venture capital partners, suits at target customers, etc.
Right, as suggested elsewhere in this thread, try to sell a solution or a service, maybe cloud based. But setting that up and getting it to good traction would take some cash, more than I'm going to put forward.