|
|
|
|
|
by notacop31337
1303 days ago
|
|
My understanding is that this is largely because you're fighting an adversarial format provider in Adobe, I've read a few papers and journal entries on file format polyglotting, with some focus on PDF and approaches are constantly shifting in nature due to Adobe mooting pathways to success, I think it's partly for security and also IMO partly for obscurity as PDF is a horrific format in all reality except for human visual interpretation. Many organisations and industries ONLY produce public data via PDF and it makes the parsing of that information a far more difficult task (again, I suspect by design). After trying a few parsing options for PDF, I've come to hate it as a format. Luckily though, some options from a few cloud providers seem to be really hammering the problems complexity down, but the cost of the solutions can be very steep. |
|
People who think of the format as "adversarial" are wrong. Adobe never gave a shit about being adversarial in that sense.
The problem is that PDF is not a file format, it's a defined subset of a programming language (PostScript) used for portable rendering with fidelity. It's portable, in the sense that it should render the same way on whatever device it's rendered on (printed on a page or mastered to a display). And it's portable because it doesn't allow any postscript job-level commands, and it tries to ensure that each PDF File is standalone and can be concatenated together into a multi-page document or embedded in another document.
Postscript (and PDF) are also postfix, which can be confusing.