|
|
|
|
|
by FreakLegion
3122 days ago
|
|
> allows us to re-purpose the same code for PDFs, Word Docs, RTF It doesn't seem like you've looked into this. The interesting data in PDFs and Office docs is all encoded, often multiple times. E.g. OOXML docs are ZIP files and store macros in an OLE container, where they're further encoded in streams. You can kind of get away with not parsing PE files, although you're missing out in that case. For PDFs, Office docs, and most other non-binary, non-script types, though, you have no choice but to parse. |
|
This paper looked specifically at PE files because it's the hardest case of any of the file types (in our opinion & experience), and it's the one we have the most data for. We've built models for many other file types with success using much less data (though we are always looking for more).