Hacker News new | ask | show | jobs
by guidovranken 1591 days ago

  The bottom line is that when you need to redact text, use black bars covering the whole text. Never use anything else.
That actually may not be enough if you're applying the black bar to compressed image data like JPEG because compression artifacts surrounding the black bar can be leaking information about the covert data.
3 comments

Might be interesting to test how plausible that is. How likely is it that a human doesn’t see the artifacts but they are leading enough info to reconstruct the underlying data.
I think it's more likely the JPEG blocks which straddle the box edges simply don't cover any meaningful part of the text. 12 pt font at 96 DPI is 16 pixels tall, meaning 50% of the vertical height of a line simply wouldn't fall into the blocks straddling the edges of a line-height box. You'd get ascenders and descenders but not much else. Tops of numbers or all-caps I think is best case.

Though, web images now are being served in higher resolutions (200+ DPI) for "retina" displays, and scanned images are generally 300 DPI, in which case you'd be lucky even to get ascenders and descenders.

I'd be curious to give it a try though. If Facebook memes are any indication, many humans are totally oblivious to near-unreadable levels of artifacting.

You could resample the image to a slightly smaller size before saving that would redo the compression bucketing.
Save the edited version to PNG first, and then back to JPEG?
Not necessarily good enough. In principle you need to either get the raw original, or black out every macroblock that ever contained any sensitive information.
Saving to PNG doesn't necessarily change anything (though see below) -- the issue is the artifacts that are already present.

JPEG breaks an image into 8x8 pixel blocks. Each of those blocks then has its information content reduced, so that it can be described in fewer bytes. (I.e., information is thrown away -- making JPEG "lossy", and producing visible artifacts.) This has the necessary side-effect that, when reconstituted, this 8×8 block now contains redundant information (if not, then the compression of that block was not lossy). This finally implies that at least some certain pixels of that block can be (at least partially) inferred from other pixels. That is, if lost, they can be recreated.

(It's helpful to understand also that JPEG does not encode each block on its own, but additionally factors out block commonalities into a central "dictionary".)

For the above to be useful to infer text hidden by a black box, requires:

(a) that the edges of the black box are not aligned to the 8×8 grid;

(b) that the relevant portions of text to be recovered lie near the edges of the black box (i.e., within the 8x8 blocks which straddle the edges); and

(c) that these blocks originally contained data of sufficient complexity, and/or deviating sufficiently from the rest of the image content, that the encoder decided to throw away sufficient information in these blocks to leave significant artifacts.

Finally, if the redacted image was re-encoded as a JPEG (or other lossy format), the re-encoding process must not have thrown away too much information in these blocks, else the redundant information will have been obscured and rendered all but useless for reconstituting the redacted information.

So, an easy way to avoid having redacted information extracted in this manner is simply, to ensure that your black boxes extend at least 8 pixels beyond the redacted text in each direction. (And also, to force the JPEG encoder not to re-use the dictionary from the original image, as information about the statistical distribution of block data could theoretically be extracted from that. Round-tripping through PNG is one way to force this additional safety measure.)

This still isn't 100% information-theoretic secure -- there's still residual information in artifacts elsewhere in the image about what patterns the original image's dictionary contained (which could be extracted with e.g. principal component analysis), which, when combined with a prior statistical distribution of the expected uncompressed content of the image, could leak some information about the portions which were redacted -- but I suspect the amount of information available via this channel to be vanishingly small.