Hacker News new | ask | show | jobs
by Tangurena2 924 days ago
Mostly, the page description language (inside PDFs) says "move to point (x, y) and write text foo" or "move to point (x, y) and draw glyph bar".

The reason that many PDFs produce garbage when extracting text is because the underlying document doesn't include fonts, every letter is a drawn glyph. This is most common in older (1990s) pdfs generated on UNIX systems.

Since the page description language is saying "write text foo", that text is broken up by the generating software, so there is not necessarily a whole line of text as a human would see it.

And some PDFs are impossible to extract text from because they've flattened the page into an image. Law firms are notorious for doing this - to provide the documents exactly as required/specified during discovery, but make them impossible for the text to be extracted. Basically it is a fax - every page is a TIFF image (because it is harder to OCR than a JPEG, although JBIG2 has its own flaws [0]).

I've been working with PDF projects off and on since the late 90s. The standard tries to be everything for everybody and that makes it a Charlie Foxtrot from top to bottom (if you've ever written an object viewer/editor to dig inside PDFs, you know exactly what I'm referring to). It is a great spec for making a document that appears the same no matter where it is viewed or printed. But I always treat it as a sausage: you can turn the cow into a sausage, but you can't turn that sausage back into a cow.

0 - https://www.theverge.com/2013/8/6/4594482/xerox-copiers-rand...

2 comments

I wonder if this explains why trying to copy and paste text out of a PG&E bill would always come back as gobbledygook when I used to receive such bills in the past.
DJVU works like that too; -BUT- you can embed the text with some internal operation in both kind of decuments.

On GNULinux/BSD you have OCRmyPDF to do that.