Hacker News new | ask | show | jobs
by layer8 167 days ago
These PDFs apparently used the “incremental update” feature of PDF, where edits to the document are merely appended to the original file.

It’s easy to extract the earlier versions, for example with a plain text editor. Just search for lines starting with “%%EOF”, and truncate the file after that line. Voila, the resulting file is the respective earlier PDF version.

(One exception is the first %%EOF in a so-called linearized PDF, which marks a pseudo-revision that is only there for technical reasons and isn’t a valid PDF file by itself.)

3 comments

It's hilarious the extent to which Adobe Systems's ridiculously futile attempt to chase MS Word features ended up being the single most productive espionage tool of the last quarter century.
I don’t think this was particularly modeled on MS Word. The incremental update feature was introduced with PDF 1.2 in 1996. It allows to quickly save changes without having to rewrite the whole file, for example when annotating a PDF.

Incremental updates are also essential for PDF signatures, since when you add a subsequent signature to a PDF, you couldn’t rewrite the file without breaking previous signatures. Hence signatures are appended as incremental updates.

PDF files are for storing fixed (!!) output of printed/printable material. That's where the format's roots are via Postscript, it's where the format found its main success in document storage, and it's the metaphor everyone has in mind when using the format.

PDFs don't change. PDFs are what they look like.

Except they aren't, because Adobe wanted to be able to (ahem) "annotate" them, or "save changes" to them. And Adobe wanted this because they wanted to sell Acrobat to people who would otherwise be using MS Word for these purposes.

And in so doing, Adobe broke the fundamental design paradigm of the format. And that has had (and continues to have, to hilarious effect) continuing security impact for the data that gets stored in this terrible format.

When Acrobat came out cross platform was not common. Being able to publish a document that could be opened on multiple platforms was a big advantage. I was using it to distribute technical specifications in the mid 90's. Different pages of these specifications came from, Filemaker, Excel, Word, Mini-Cad, Photoshop, Illustrator, and probably other applications as well. We would combine these into a single PDF file. This simplified version control. This also meant that bidders could not edit the specifications.

None of that could be accomplished with Word alone. I think you are underestimating the qualities of PDF for distribution of complex documents.

> This also meant that bidders could not edit the specifications.

But they can! That's the bug, PDF is a mutable file format owing to Adobe's muckery. And you made the same mistake that every government redactor and censor (up to and including the ?!@$! NSA per the linked article) has in the intervening decades.

The file format you thought you were using was a great fit for your problem, and better than MS Word. The software Adobe shipped was, in fact, something else.

Our use wasn't that high stakes. I also can't remember all the details but were using 3rd party publishing tools beyond Acrobat. Acrobat was the reader that our specifications could be opened in. We had to use additional software and to add consistent headers and get accurate pager numbers.
It started in the '80s. PostScript was the big deal. It was a printer language, not a document language. It was not limited to “(mostly) text documents”, even though complex vector fonts and even hinting were introduced. For example, you could print some high quality vector graphs in native printer resolution from systems which would never ever get enough memory to rasterise such giant bitmaps, by writing/exporting to PostScript. That's where Adobe's business was. See also NeWS and NeXT.

However, arbitrary non-trivial PostScript files were of little use to people without a hardware or software rasteriser (and sometimes fonts matching the ones the author had, and sometimes the specific brand of RIP matching the quirks of authoring software, etc.), so it was generally used by people in publishing or near it. PDF was an attempt to make a document distribution format which was more suitable to more common people and more common hardware (remember the non-workstation screen resolutions at the time). I doubt that anyone imagined typical home users writing letters and bulletins in Acrobat, of all things (though it does happen). It would be similar to buying Photoshop to resize images (and waiting for it to load each time). Therefore, competitor to Word it was not. Vice versa, Word file was never considered a format suitable for printing. The more complex the layout and embedded objects, the less likely it would render properly on publisher's system (if Microsoft Office did exist for its architecture at all). Moreover, it lacked some features which were essential for even small scale book publishing.

Append-only or versioned-indexed chunk-based file formats for things we consider trivial plain data today were common at the time. Files could be too big to rewrite completely each time even without edits, just because of disk throughput and size limits. The system could not be able to load all of the data into memory because of addressing or size limitations (especially when we talk about illustrations in resolutions suitable for printing). Just like modern games only load the objects in player's vicinity instead of copying all of the dozens or hundreds of gigabytes into memory, document viewers had to load objects only in the area visible on screen. Change the page or zoom level, and wait until everything reloads from disk once again. Web browsers, for example, handle web pages of any length in the same fashion. I should also remind you that default editing mode in Word itself in the '90s was not set to WYSIWYG for similar performance reasons. If you look at the PDF object tree, you can see that some properties are set on the level above the data object, and that allows overwriting the small part of the index with the next version to change, say, position without ever touching the chunk in which the big data itself stays (because appending the new version of that chunk, while possible, would increase the file size much more).

Document redraw speed can be seen in this random video. But that's 1999, and they probably got a really well performing system to record the promotional content. https://www.youtube.com/watch?v=Pv6fZnQ_ExU

PDF is a terrible format not because of that, but because its “standard” retroactively defined everything from the point of view of Acrobat developer, and skipped all the corner cases and ramifications (because if you are an Acrobat developer, you define what is a corner case, and what is not). As a consequence, unless you are in a closed environment you control, the only practical validator for arbitrary PDFs is Acrobat (I don't think that happened by chance). The external client is always going to say “But it looks just fine on my screen”.

I'm pretty sure you can change various file formats without rewriting the entire file and without using "incremental updates".
You can’t insert data into the middle of a file (or remove portions from the middle of a file) without either rewriting it completely, or at least rewriting everything after the insertion point; the latter requires holding everything after the insertion point in memory (or writing it out to another file first, then reading it in and writing it out again).

PDF is designed to not require holding the complete file in memory. (PDF viewers can display PDFs larger than available memory, as long as the currently displayed page and associated metadata fits in memory. Similar for editing.)

While tedious, you can do the rewrite block-wise from the insertion point and only store a an additional block's worth of the rest (or twice as much as you inserted)

ABCDE, to insert 1 after C: store D, overwrite D with 1, store E, overwrite E with D, write E.

No, if you are going to change the structure of a structured document that has been saved to disk, your options are:

1) Rewrite the file to disk 2) Append the new data/metadata to the end of the existing file

I suppose you could pre-pad documents with empty blocks and then go modify those in situ by binary editing the file, but that sounds like a nightmare.

Aren't there file systems that support data structures which allow editing just part of the data, like linked lists?
Yeah there are, Linux supports parameters FALLOC_FL_INSERT_RANGE and FALLOC_FL_COLLAPSE_RANGE for fallocate(2). Like most fancy filesystem features, they are not used by the vast majority of software because it has to run on any filesystem so you'd always need to maintain two implementations (and extensive test cases).
Look at the C file API which most software is based on, it simply doesn’t allow it. Writing at a given file position just overwrites existing content. There is no way to insert or remove bytes in the middle.

Apart from that, file systems manage storage in larger fixed-size blocks (commonly 4 KB). One block typically links to the next block (if any) of the same file, but that’s about the extent of it.

No. Well yes. On mainframes.

This is why “table of contents at the end” is such an exceedingly common design choice.

This was 1996. A typical computer had tens of megabytes of memory with throughput a fraction of what we have today. Appending an element instead of reading, parsing, inserting and validating the entire document is a better solution in so many ways. That people doing redactions don't understand the technology is a separate problem. The context matters.
New OSINT skill unlocked
I see an interesting parallel to how people think about captured encrypted data, and how long that encryption needs to be effective for until technology catches up and can decrypt (by which point, hopefully the decrypted data is worthless). If all of these documents are stored in durable archives, future methodologies may arrive to extract value or intelligence not originally available at the time of capture and disclosure.
> If all of these documents are stored in durable archives, future methodologies may arrive to extract value or intelligence not originally available at the time of capture and disclosure.

I recently learned that some people improve or brush up on their OSINT skills by trying to find missing people!

Microsoft Word once had a "Fast Save" feature which did this. It's hard to find much information about it these days. Supposedly it was removed in Office 2003 SP3: https://www.betaarchive.com/wiki/index.php?title=Microsoft_K....