| HN Mirror

I'm responding to your comment about parsing, which doesn't jibe with reality. It's not a question of data science, just of what's visible to the naïve byte-driven approach.

E.g. in PDFs malicious JavaScript might be buried in an XFA stream with /Type /EmbeddedFile and /Filter /FlateDecode -- a ZIP file, in other words. Nothing about this is suspicious; benign PDFs do it too.

Without looking inside the stream, you can't know whether it's bad. The rest of the PDF is incidental and can be swapped out with no change to the attack.

Can your approach produce a model to detect these PDFs? Sure, by overfitting a small/homogeneous data set. Which, to be fair, is almost impossible not to do, because sourcing and curating data is the hardest part of security-related data science. But in the wild, your miss rates will skyrocket.

This will all make more sense if you ever deploy. Then you'll see issues even in your PE model, for example with installers, signed files, parasitics, generic packers, p-code, DLLs, drivers, on and on.