Hacker News new | ask | show | jobs
by crispyambulance 924 days ago
> I have no idea why somebody hates pdfs for extracting data, when stuff like doc,xls (the old format) is clearly way worse.

Is this sarcasm?

AFAIK, pdf is deliberately designed to not give a F about semantics. There is no way to determine what is part of what in a pdf document. All you got is association by adjacency.

Hasn't it always been that way? Has something changed?

1 comments

Most often the text runs you would extract from a PDF are in the order you would read the text. Probably because the word processor or application that created the PDF dumped the text into the capturing PDF context from its own text container that, in fact, contains the text in the order the word processor would display it (editing, searching, text selection in the creating app obviously benefit if the text container is in reading order).

When text is not in reading order within a PDF page it is often headers, footers, captions, callouts, block quotes....

There are I believe features in the modern PDF spec to allow for accessibility that would give you more structure to that raw text. I am not sure that this is a widely used feature when creating PDFs though.

It's true that data is often written out in a logical order, but like you say, that's only because the program that created it was designed that way. I've definitely seen PDF files where tabular data is almost in a logical order but every now and then cells have been jumbled around.

But what I've definitely seen is documents where the characters are deliberately jumbled up and a custom font so that visually everything looks fine. I know this, because there was one specific case where I wanted to extract about 5000 words in a vocabulary list and it was hard to decipher. They'd used several such fonts in the single document as well, so there wasn't a one-to-one mapping of the text encryption. They'd also put a watermark under the list, so you also couldn't easily do OCR of the final screen image either.

To be sure, the content-creator can run riot with the PDF spec and make it suck for everyone but a human reading the screen or printed page. Fortunately I would say 99% of PDFs are much better behaved than that.