| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by JKCalhoun 926 days ago

Most often the text runs you would extract from a PDF are in the order you would read the text. Probably because the word processor or application that created the PDF dumped the text into the capturing PDF context from its own text container that, in fact, contains the text in the order the word processor would display it (editing, searching, text selection in the creating app obviously benefit if the text container is in reading order).

When text is not in reading order within a PDF page it is often headers, footers, captions, callouts, block quotes....

There are I believe features in the modern PDF spec to allow for accessibility that would give you more structure to that raw text. I am not sure that this is a widely used feature when creating PDFs though.

1 comments

ralferoo 926 days ago

It's true that data is often written out in a logical order, but like you say, that's only because the program that created it was designed that way. I've definitely seen PDF files where tabular data is almost in a logical order but every now and then cells have been jumbled around.

But what I've definitely seen is documents where the characters are deliberately jumbled up and a custom font so that visually everything looks fine. I know this, because there was one specific case where I wanted to extract about 5000 words in a vocabulary list and it was hard to decipher. They'd used several such fonts in the single document as well, so there wasn't a one-to-one mapping of the text encryption. They'd also put a watermark under the list, so you also couldn't easily do OCR of the final screen image either.

JKCalhoun 926 days ago

To be sure, the content-creator can run riot with the PDF spec and make it suck for everyone but a human reading the screen or printed page. Fortunately I would say 99% of PDFs are much better behaved than that.