Hacker News new | ask | show | jobs
by marcan_42 2064 days ago
PDF objects within the file are usually compressed. That means if anything changes, the whole compressed binary blob changes.

Other than compression and such encodings, PDF files are actually text files, with a drawing model largely based on PostScript but without the programming. If you want to diff them, use `mutool clean -d -a` to first turn them into pure ASCII text.

That said, since it's a "baked" layout format, if one word pushes the rest of the text forward, everything after that will show up with changed coordinates. It's closer to a vector image format like SVG than a markup format like HTML or ODF.

There are also things like font subsetting, where removing a word that was the only use of a character, or adding a word that uses a new character, might change the font data to add/remove those characters.