Hacker News new | ask | show | jobs
by tristanm 2577 days ago
This is a good illustration of the core problem we have in "anomaly detection" in data science. Often we are presented with a challenge that if solved, would negate the presence of the challenge itself: We have to look for events that aren't explained or predicted to exist by our current understanding of the given system. To find them, we collect all events and evaluate their likelihood under our best model, taking the least-likely as our "anomalous" events. Then, once found, we have to explain them. But to explain them requires that we understand the system well enough to predict the existence of those events. If we did, we could have produced a better model, and that model would have rated those events as more likely. So they wouldn't have shown up. This contradiction seems to be inherent to the whole concept of anomaly detection.
1 comments

It's not a contradiction. The anomalous events tell you to improve your model, so while your model yesterday was insufficient your model tomorrow will not be. If you're wondering why the model yesterday is not the best possible already, it's because you make guesses about what's important and what's not; guesses which are refined by correcting your model in the presence of anomalous events.
If you start with a weak model that doesn't contain all the knowledge you have available, your anomalies will contain many irrelevant or already known things. If you start with a strong model representing the best current understanding, then correcting the model is not so straightforward.
Suggesting any model that’s not 100% consistent with all known information is weak clearly misses the point. Models which can be automated in a reasonable timeframe on limited hardware beat those who can’t.

The goal is to find interesting things in the data, not simply take years of data and return “everything looks normal.”