|
I'm thankful PDF won, because otherwise I think it would have been Microsoft Word. There was a time when papers, books, resumes, contracts, etc. almost always came as Word. Does anyone else remember getting a book as preface.doc, chap1.doc, chap1a.doc, chap2.doc, subchap2a2.doc, and so on, and a mess of jpegs and gifs and trying to figure out how it had to be assembled, and discovering something was missing, or that one chapter was newer than the others. That's one reason I really like PDF -- it's one file, self-contained, and linear. On the other hand, I really wish it was more diff'able. If for example a credit card company changes one word in their terms & conditions PDF, it seems like 90% of document changes at the binary level. I know that PDF diff tools exist, but there must be tremendous internal complexity in the PDF format for tiny changes to alter the whole structure. |
Other than compression and such encodings, PDF files are actually text files, with a drawing model largely based on PostScript but without the programming. If you want to diff them, use `mutool clean -d -a` to first turn them into pure ASCII text.
That said, since it's a "baked" layout format, if one word pushes the rest of the text forward, everything after that will show up with changed coordinates. It's closer to a vector image format like SVG than a markup format like HTML or ODF.
There are also things like font subsetting, where removing a word that was the only use of a character, or adding a word that uses a new character, might change the font data to add/remove those characters.