| Great summary! By nature of my job (eng lead of a major mobile malware detection team) I have a lot of startups pitch their ML solutions to me. A couple of thoughts: - There are no publicly available data sets for training available. There are a few small ones and a few old ones, but they don't reflect the reality of 2016. Companies that approach me and pitch me solutions to the malware of 2012 are not useful. - The majority of mobile malware is based on some kind of social engineering. On a code level these are indistinguishable from legitimate applications (the same APIs are used in the same fashion). The only difference is whether app behavior meets user expectations or not. Making this decision automatically seems intractable so far. - Malware is not really a well-defined term. There is phishing, toll fraud, Trojans, privilege escalation exploits, ... If you generically look for malware, the signals you will look for are going to approach the complete set of APIs made available by your OS. Your results will just be a giant blob where everything is connected. Pick a single malware category and focus on just that at a time. ML signals for priv esc will look very different from those for phishing. - ML is sexy. Malware analysis is not. Startups seem to hire too many ML people and not enough malware analysis people. I've had startups pitch to me that had literally zero people on staff who knew what mobile malware actually looked like. They just did anomaly detection and then tossed the results over to my team to verify the results. That's not how it works. We're not your QA team. :) |
See the Microsoft / Kaggle challenge on classifying malware families, winning solution is > 99% accuracy IIRC.