| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by brotchie 176 days ago
	You'd think the go-to workflow for releasing redacted PDFs would be to draw black rectangles and then rasterize to image-only PDFs :shrug:

2 comments

selinkocalar 176 days ago

As someone who's built an entire business on "anti-screenshots" this is brilliant.

PDF redaction fails are everywhere and it's usually because people don't understand that covering text with a black box doesn't actually remove the underlying data.

I see this constantly in compliance. People think they're protecting sensitive info but the original text is still there in the PDF structure.

link

embedding-shape 176 days ago

Not to mention some PDF editors preserve previous edits in the PDF file itself, which people also seems unaware of. A bit more user friendly description of the feature without having to read the specification itself: https://developers.foxit.com/developer-hub/document/incremen...

link

shbooms 176 days ago

often times you will have requirements that the documents you release be digitally searchable and so in these cases, this would not be an option

link

pottertheotter 176 days ago

This made me think of something I came across recently that’s almost the opposite problem of requiring PDFs to be searchable. A local government would publish PDFs where the text is clearly readable on screen, but the selectable text layer is intentionally scrambled, so copy/paste or search returns garbage. It's a very hostile thing to do, especially with public data!

link

2ICofafireteam 173 days ago

I have encountered PDFs that would exhibit this behavior in one browser but not in another.

One fun thing I encountered from local government is releasing files with potato quality resolution and not considering the page size.

I had a FOI request that returned mainly Arch D sized drawings but they were in a 94 DPI PDF rendered as letter sized. It was a fun conversation trying to explain to an annoyed city employee that putting those large drawings in a 94 DPI letter size page effectively made it 30-ish DPI.

link

eviks 175 days ago

Hostile indeed, and also happens in user-facing documents like product manuals!

link

8note 176 days ago

run some ocr on them after to recreate the text layer?

link

albert_e 176 days ago

With the aggressive push of LLMs and Generative AI ..i am expecting a lot of OCR features to become "smarter" by default, namely go beyond mechanical OCR and start inserting hallucinations and sematically/contextually "more correct" information in OCR output

It's not hard to imagine some powerful LLMs being able to undo some light redactions that are deducible based on context

link

blharr 175 days ago

Or worse, making up names or information instead of writing the reaction.

link