| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by Deregibus 3405 days ago

This was a good explanation of what's happening here from a previous thread: https://news.ycombinator.com/item?id=13715761

The key is that essentially all of the data for both images are in both PDFs, so the PDFs are almost identical except for a ~128 byte block that "selects" the image and provides the necessary bytes to cause a collision.

Here's an diff of the 2 PDFs from when I tried it earlier: https://imgur.com/a/8O58Q

Not to say that there isn't still something exploitable here, but I don't think it means that you can just create collisions from arbitrary PDFs.

edit: Here's a diff of shattered-1.pdf released by Google vs. one of the PDFs from this tool. The first ~550 bytes are identical.

https://imgur.com/a/vVrrQ

2 comments

niftich 3405 days ago

I didn't get a chance to make this point in that other thread, because the thread [1] of its follow-ups quickly morphed from promising [2] to meandering, but the combination of lax formats (PDF and JPEG in this instance) makes this style of collision particularly reductive, and in a sense, a cheapshot, if still devastatingly damaging given both PDF's and JPEG's ubiquity -- both separately and together -- in document storage and archival.

This shows the importance of techniques like canonicalization and determinism, which ensure that given a particular knowledge set, that result could only have been arrived at given exactly one input. For general-purpose programming languages like PostScript, of which PDF is a subset, this is essentially an unfulfillable requirement, as any number of input "source code" can produce observationally "same" results. Constrained formats, and formats where the set of 'essential information' can be canonicalized into a particular representation should be the norm, rather than the exotic exception, especially in situations where minute inessential differences can be cascaded to drastically alter the result.

[1] https://news.ycombinator.com/item?id=13715761 [2] https://news.ycombinator.com/item?id=13718772

link

smallnamespace 3405 days ago

> Constrained formats, and formats where the set of 'essential information' can be canonicalized into a particular representation should be the norm, rather than the exotic exception, especially in situations where minute inessential differences can be cascaded to drastically alter the result.

That might be very challenging in practice, because a more expressive language directly allows a more compressed/efficient encoding of the same information, but at the cost of being more difficult (or impossible) to create a canonical representation.

Also, data formats that are purposefully redundant for error tolerance all basically have the property that readers should be tolerant of non-canonical forms. If we want to redundantly represent some bytes redundantly in case of data loss, then there must be multiple representations of those bytes that are all acceptable for the reader for this to work.

Video and image formats use multiple encodings to give encoders the room to make time-space trade-offs.

link

acqq 3405 days ago

I agree, for anything that a human is supposed to see with the eyes, there are always different representations that look the "same" enough to be indistinguishable.

People should be aware of it, not believe in a non-existing world where it isn't so.

link

Mindless2112 3405 days ago

> PDF and JPEG

Add ELF [1] and Zip [2] to the list. Many common file formats have areas where you can insert an arbitrary chunk of data without significant side effects.

[1] ELF allows for a very flexible layout, and is almost certainly vulnerable to this length-extension-based attack.

[2] Zip allows a comment at the end of the central directory. Since the central directory is at the end of the file, I don't know if it's vulnerable to this exact attack.

link

therein 3405 days ago

How about PE format? You can basically write after the ImageBase+ImageLength and have no functional difference.

link

mfukar 3405 days ago

As the paper [http://shattered.io/static/shattered.pdf] says:

> This is an identical-prefix collision attack, where a given prefix P is extended with two distinct near-collision block pairs such that they collide for any suffix S

The near-collision block pairs is the difference everyone can see in the image. Whoever created the PDFs did everyone the courtesy of keeping the suffix the same. There's numerous examples already of different PDFs with the same hash.

link