|
|
|
|
|
by Deregibus
3405 days ago
|
|
This was a good explanation of what's happening here from a previous thread: https://news.ycombinator.com/item?id=13715761 The key is that essentially all of the data for both images are in both PDFs, so the PDFs are almost identical except for a ~128 byte block that "selects" the image and provides the necessary bytes to cause a collision. Here's an diff of the 2 PDFs from when I tried it earlier: https://imgur.com/a/8O58Q Not to say that there isn't still something exploitable here, but I don't think it means that you can just create collisions from arbitrary PDFs. edit:
Here's a diff of shattered-1.pdf released by Google vs. one of the PDFs from this tool. The first ~550 bytes are identical. https://imgur.com/a/vVrrQ |
|
This shows the importance of techniques like canonicalization and determinism, which ensure that given a particular knowledge set, that result could only have been arrived at given exactly one input. For general-purpose programming languages like PostScript, of which PDF is a subset, this is essentially an unfulfillable requirement, as any number of input "source code" can produce observationally "same" results. Constrained formats, and formats where the set of 'essential information' can be canonicalized into a particular representation should be the norm, rather than the exotic exception, especially in situations where minute inessential differences can be cascaded to drastically alter the result.
[1] https://news.ycombinator.com/item?id=13715761 [2] https://news.ycombinator.com/item?id=13718772