Hacker News new | ask | show | jobs
by EdwardRaff 3119 days ago
At a very high level, yes. But the same could be said for anybody in the AI-AV space.

At a more technical level, the approach we take in this paper (and most of my research) is fairly orthogonal to what most AV vendors are doing. Even compared to the AI based solutions.

The idea here was to throw away everything we know about the file being a valid Windows PE binary, and try and let the network learn what it needs on its own. Its making the problem harder, but allows us to re-purpose the same code for PDFs, Word Docs, RTF - basically any file format we can get data for. This gives us a lot of potential flexibility that others don't have.

1 comments

> allows us to re-purpose the same code for PDFs, Word Docs, RTF

It doesn't seem like you've looked into this. The interesting data in PDFs and Office docs is all encoded, often multiple times. E.g. OOXML docs are ZIP files and store macros in an OLE container, where they're further encoded in streams.

You can kind of get away with not parsing PE files, although you're missing out in that case. For PDFs, Office docs, and most other non-binary, non-script types, though, you have no choice but to parse.

We have looked into it, it's just not in this paper. It actually works better on other file formats. PDFs are really easy to do with even simpler techniques, no parsing is needed. Modern office docs need to be unzipped first, but that's not complicated. Old office 97 docs are also a common vector that doesn't need to be processed.

This paper looked specifically at PE files because it's the hardest case of any of the file types (in our opinion & experience), and it's the one we have the most data for. We've built models for many other file types with success using much less data (though we are always looking for more).

I'm responding to your comment about parsing, which doesn't jibe with reality. It's not a question of data science, just of what's visible to the naïve byte-driven approach.

E.g. in PDFs malicious JavaScript might be buried in an XFA stream with /Type /EmbeddedFile and /Filter /FlateDecode -- a ZIP file, in other words. Nothing about this is suspicious; benign PDFs do it too.

Without looking inside the stream, you can't know whether it's bad. The rest of the PDF is incidental and can be swapped out with no change to the attack.

Can your approach produce a model to detect these PDFs? Sure, by overfitting a small/homogeneous data set. Which, to be fair, is almost impossible not to do, because sourcing and curating data is the hardest part of security-related data science. But in the wild, your miss rates will skyrocket.

This will all make more sense if you ever deploy. Then you'll see issues even in your PE model, for example with installers, signed files, parasitics, generic packers, p-code, DLLs, drivers, on and on.