Hacker News new | ask | show | jobs
by merb 924 days ago
Actually most pdfs are formatted in a good way and it’s easy to extract text. The stupid stuff is just copy encryption, which is just a stupid feature (because pdf viewers can ignore it) I have no idea why somebody hates pdfs for extracting data, when stuff like doc,xls (the old format) is clearly way worse. Pdf sometimes has its quirks but the 2.0 version clearly cleans up a lot of the messes
2 comments

It’s usually easy to extract individual strings from a PDF, normally single lines, but it can be quite hard to understand how those form into longer paragraphs, especially if the page has multiple columns and inline figures.

It’s also easy to create a PDF that it is hard to extract text from, not through an deliberate attempt to enforce copy protection but often simply from attempts to compress the size of the file as you may not want to store the entirety of a font in a document.

I’ve been on both ends of this, generating documents and consuming them, and I think we probably created something that allowed for much easier text extraction, but it’s far too late now.

> I have no idea why somebody hates pdfs for extracting data, when stuff like doc,xls (the old format) is clearly way worse.

Is this sarcasm?

AFAIK, pdf is deliberately designed to not give a F about semantics. There is no way to determine what is part of what in a pdf document. All you got is association by adjacency.

Hasn't it always been that way? Has something changed?

Most often the text runs you would extract from a PDF are in the order you would read the text. Probably because the word processor or application that created the PDF dumped the text into the capturing PDF context from its own text container that, in fact, contains the text in the order the word processor would display it (editing, searching, text selection in the creating app obviously benefit if the text container is in reading order).

When text is not in reading order within a PDF page it is often headers, footers, captions, callouts, block quotes....

There are I believe features in the modern PDF spec to allow for accessibility that would give you more structure to that raw text. I am not sure that this is a widely used feature when creating PDFs though.

It's true that data is often written out in a logical order, but like you say, that's only because the program that created it was designed that way. I've definitely seen PDF files where tabular data is almost in a logical order but every now and then cells have been jumbled around.

But what I've definitely seen is documents where the characters are deliberately jumbled up and a custom font so that visually everything looks fine. I know this, because there was one specific case where I wanted to extract about 5000 words in a vocabulary list and it was hard to decipher. They'd used several such fonts in the single document as well, so there wasn't a one-to-one mapping of the text encryption. They'd also put a watermark under the list, so you also couldn't easily do OCR of the final screen image either.

To be sure, the content-creator can run riot with the PDF spec and make it suck for everyone but a human reading the screen or printed page. Fortunately I would say 99% of PDFs are much better behaved than that.