Hacker News new | ask | show | jobs
by EdwardRaff 3112 days ago
Paper author here!

A lot of that issue comes from people using bad datasets. One of our first papers was about that ( http://www.readcube.com/articles/10.1007/s11416-016-0283-1?a... ), and showed that using the data most people use in their research, benign data collected from clean Microsoft installs, is not sufficient. The model will literally learn to look for the string "Copyright Microsoft Corporation" to decide if something is benign. Everything else ends up getting marked as malicious.

We are using better data in this work, and it does not suffer from this problem. It is not ready to be a real production AV, but it does a fairly good job at separating out benign vs malicious files and dealing with non-trivial examples of both.