Hacker News new | ask | show | jobs
Show HN: Why Two Identical PDFs Have Different SHA-256 Hashes (How We Fixed It) (docs.pdfcanon.com)
1 points by napzoom 44 days ago
3 comments

It bugs me when editors don't keep track of whether or not a document has been modified. If I hit save and I haven't modified it, but the editor writes a new timestamp, it is no longer accurate, because it wasn't edited at that time.
I'm the author. The post covers the seven sources of non-determinism in the spec. The one that surprised me most in practice: about half the tools I tested regenerate the /ID array on every save, even when the spec says it should be stable. Happy to dig into any of the pipeline stages or the qpdf invocation sequence if useful.
Spoiler: They're not identical.
That's exactly the point - from the spec's perspective they aren't identical, but from every user's perspective they are. Same pages, same text, same rendering. The interesting question is whose definition of "identical" should govern when you're building audit trails or deduplication systems.