| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by FreakLegion 3122 days ago

> allows us to re-purpose the same code for PDFs, Word Docs, RTF

It doesn't seem like you've looked into this. The interesting data in PDFs and Office docs is all encoded, often multiple times. E.g. OOXML docs are ZIP files and store macros in an OLE container, where they're further encoded in streams.

You can kind of get away with not parsing PE files, although you're missing out in that case. For PDFs, Office docs, and most other non-binary, non-script types, though, you have no choice but to parse.

1 comments

EdwardRaff 3122 days ago

We have looked into it, it's just not in this paper. It actually works better on other file formats. PDFs are really easy to do with even simpler techniques, no parsing is needed. Modern office docs need to be unzipped first, but that's not complicated. Old office 97 docs are also a common vector that doesn't need to be processed.

This paper looked specifically at PE files because it's the hardest case of any of the file types (in our opinion & experience), and it's the one we have the most data for. We've built models for many other file types with success using much less data (though we are always looking for more).

FreakLegion 3121 days ago

I'm responding to your comment about parsing, which doesn't jibe with reality. It's not a question of data science, just of what's visible to the naïve byte-driven approach.

E.g. in PDFs malicious JavaScript might be buried in an XFA stream with /Type /EmbeddedFile and /Filter /FlateDecode -- a ZIP file, in other words. Nothing about this is suspicious; benign PDFs do it too.

Without looking inside the stream, you can't know whether it's bad. The rest of the PDF is incidental and can be swapped out with no change to the attack.

Can your approach produce a model to detect these PDFs? Sure, by overfitting a small/homogeneous data set. Which, to be fair, is almost impossible not to do, because sourcing and curating data is the hardest part of security-related data science. But in the wild, your miss rates will skyrocket.

This will all make more sense if you ever deploy. Then you'll see issues even in your PE model, for example with installers, signed files, parasitics, generic packers, p-code, DLLs, drivers, on and on.