|
|
|
|
|
by EdwardRaff
3121 days ago
|
|
We have looked into it, it's just not in this paper. It actually works better on other file formats. PDFs are really easy to do with even simpler techniques, no parsing is needed. Modern office docs need to be unzipped first, but that's not complicated. Old office 97 docs are also a common vector that doesn't need to be processed. This paper looked specifically at PE files because it's the hardest case of any of the file types (in our opinion & experience), and it's the one we have the most data for. We've built models for many other file types with success using much less data (though we are always looking for more). |
|
E.g. in PDFs malicious JavaScript might be buried in an XFA stream with /Type /EmbeddedFile and /Filter /FlateDecode -- a ZIP file, in other words. Nothing about this is suspicious; benign PDFs do it too.
Without looking inside the stream, you can't know whether it's bad. The rest of the PDF is incidental and can be swapped out with no change to the attack.
Can your approach produce a model to detect these PDFs? Sure, by overfitting a small/homogeneous data set. Which, to be fair, is almost impossible not to do, because sourcing and curating data is the hardest part of security-related data science. But in the wild, your miss rates will skyrocket.
This will all make more sense if you ever deploy. Then you'll see issues even in your PE model, for example with installers, signed files, parasitics, generic packers, p-code, DLLs, drivers, on and on.