Hacker News new | ask | show | jobs
Show HN: Midas, a Streaming Anomaly Detector. Now Implemented in Go (github.com)
88 points by steve0hh 2245 days ago
4 comments

Neither title, comments here nor Github repo indicated to an outsider like me what this is or what field it's even from. The abstract of the linked paper helps though:

"Given a stream of graph edges from a dynamic graph, how can we assign anomaly scores to edges in an online manner, for the purpose of detecting unusual behavior, using constant time and memory? Existing approaches aim to detect individually surprising edges. In this work, we propose MIDAS, which focuses on detecting microcluster anomalies, or suddenly arriving groups of suspiciously similar edges, such as lockstep behavior, including denial of service attacks in network traffic data. MIDAS has the following properties: (a) it detects microcluster anomalies while providing theoretical guarantees about its false positive probability; (b) it is online, thus processing each edge in constant time and constant memory, and also processes the data 162−644 times faster than state-of-the-art approaches; (c) it provides 42%-48% higher accuracy (in terms of AUC) than state-of-the-art approaches."

Thanks for your comments.

I know the documentation and examples is still a bit raw, will be working on it.

When doing the implementation, I was trying to mimic the original API(https://github.com/bhatiasiddharth/MIDAS) that was done in the C++ first.

This was in case someone wants to use the exact same API(for familiarity) in Go, they could. Posted this here to show that there is now a Go implementation of it. :)

After this, I'll be implementing a fit/predict api that mimics SKLearn's API and showing more examples on how to use it in a streaming fashion.

Thanks. I'd love some ELI5s on this.

My last gig had some newly minted CS graduates who proposed complimenting our monitoring with some anomaly detection.

Having done a bit of both data mining and optimization, a lifetime ago, I could kinda follow the big data and machine learning stuff the kids were doing.

Whereas anomaly detection and fault prediction seem magical to me.

There is a recorded presentation of the paper at https://youtu.be/Bd4PyLCHrto The first 5-10 minutes or so should be quite explanatory. Please feel free to let me know if you have any specific doubts. Thanks.
What are some interesting applications of edge anomaly detection?
Detecting Intrusions, Denial of Service (DoS) attacks, Distributed Denial of Service (DDoS) attacks. It can also be used to detect fake profiles in Social Networks like Twitter, Facebook, Amazon reviews, and Financial Frauds. Basically any suspicious similar group of edges in time-evolving/dynamic graphs.
Code is quite neat! What are the changes it will need for including fit and predict API?
I've just added the fit/predict api for Midas. Will be doing MidasR next when I've time. :)

The change needed is basically to create a MirasR struct and re-purpose the MidasR function into fit and predict.

it's not clear to me how this API supports streaming edges. Looks like it takes lists of sources and destinations and loads them into memory.
Thanks for your comments. I'll be updating the readme to better reflect it..

I've just added the midas struct, in which you can do the following:

```

m := midas.NewMidasModel(2, 769, 9460)

m.FitPredict(2,3,1) // inputs will be your src,dst and time (from your stream)

m.FitPredict(2,3,1) // same here

```

The MidasR model is next. :)

Could you go into the selection criteria for those 3 numbers? I see 2 and 769 on the original MIDAS repo as default number of hash functions and buckets, but without explanation of why they're chosen. The third argument, "m", I don't see explained anywhere.
I adapted the code from the original paper from https://github.com/bhatiasiddharth/MIDAS/ .

And from one of the issues(https://github.com/bhatiasiddharth/MIDAS/issues/7#issuecomme...), M can be any reasonable value because it's just used to compute the edge hash (https://github.com/steve0hh/midas/blob/master/edgehash.go#L4...).

Hope it clarifies.

Hi, I'm the author of the MIDAS algorithm. We choose the number of hash functions and bucket according to the maximum error we can tolerate and the false positive probability theoretical guarantee we want. Please refer to the AAAI paper here: https://www.comp.nus.edu.sg/~sbhatia/assets/pdf/midas.pdf Let me know if you need more details.