Hacker News new | ask | show | jobs
by sp_ 3636 days ago
Great summary! By nature of my job (eng lead of a major mobile malware detection team) I have a lot of startups pitch their ML solutions to me. A couple of thoughts:

- There are no publicly available data sets for training available. There are a few small ones and a few old ones, but they don't reflect the reality of 2016. Companies that approach me and pitch me solutions to the malware of 2012 are not useful.

- The majority of mobile malware is based on some kind of social engineering. On a code level these are indistinguishable from legitimate applications (the same APIs are used in the same fashion). The only difference is whether app behavior meets user expectations or not. Making this decision automatically seems intractable so far.

- Malware is not really a well-defined term. There is phishing, toll fraud, Trojans, privilege escalation exploits, ... If you generically look for malware, the signals you will look for are going to approach the complete set of APIs made available by your OS. Your results will just be a giant blob where everything is connected. Pick a single malware category and focus on just that at a time. ML signals for priv esc will look very different from those for phishing.

- ML is sexy. Malware analysis is not. Startups seem to hire too many ML people and not enough malware analysis people. I've had startups pitch to me that had literally zero people on staff who knew what mobile malware actually looked like. They just did anomaly detection and then tossed the results over to my team to verify the results. That's not how it works. We're not your QA team. :)

2 comments

Hey, just finished a malware ML custom system for one of the largest european corporations, large enough that some malware is targeted at them. Result is 97% accuracy (they did retrain and check on their own held out dataset). More careful analysis is needed (many malware have high entropy 'zones' that may help the classifier find the right category), but overall it does work.

See the Microsoft / Kaggle challenge on classifying malware families, winning solution is > 99% accuracy IIRC.

Can you describe a security setting where 97% accuracy is actually useful? Unless the events you're looking at are low volume or you somehow have much more malicious data than everyone else that seems like a recipe for your results being primarily FPs.
For context, a company can easily get ~1B security-related events a day, so even reporting say 0.1% of those wrong a day means some poor junior analyst has 1,000,000 tickets to slog through. If you expand that to full packet captures as suggested in the article... ouch.

(We do some cool visual analytics work here, including unsupervised learning / classification, and target more of the problem of "given an incident you're already investigating, what else should you now look at from across all your tools?")

We're talking hundreds of thousands of malwares here.
The 99% means little when it suffers from a similar sort of problem that the immune system has with cancer. Adversary's lack of stationarity vs a fixed model.
That's what the research under the banner security via diversity and "moving target" are doing. I recall the Hydra firewall from Sentinel did that sort of thing. OpenBSD and grsecurity do in OSS for parts of their OS. Such methods can be combined with these.

Interesting name. Reminds me of a security scheme, Symbiotes, I briefly evaluated on Schneier's blog. Injected security into legacy, embedded applications with various tradeoffs. Where did you get the name from?

Focusing on your last point: this is true in a lot of fields.

ML is amazingly powerful, but if you don't have sufficient domain knowledge, or you aren't collaborating very closely with actual experts, you can make very dangerous mistakes. Domain knowledge helps a lot - not just in malware, but in biology, image analysis, etc..