Hacker News new | ask | show | jobs
by ralferoo 922 days ago
So, reading the article is a bit weird. It's clear there's an anti-PDF bias from the start, with the implicit assumption that everybody hates reading PDF files. Actually, I don't because I get to read a well formatted document. They even say that it should only be used as a format for things to be printed, never as a document for people to read on a computer... and yet this is clearly meant to be read once on a screen and not printed out. It also contains a hypertext link to their company that obviously wouldn't work if printed, and they embed it in an iframe, because they expect people to be reading it online.

But towards the end, you start to see the real objection to PDFs - that it's not always easy to extract text automatically from a document. It mentions a few of the issues - extra spaces, not enough spaces, hidden text that gets extracted because it's off-page, fonts that are designed to obfuscate the internal text, e.g. re-arranging characters or splitting glyphs up in strange ways, etc. It's worth noting that with the exception of the spaces, these techniques are used deliberately to stop people extracting text or the copyrighted fonts from the document.

It's not at all obvious from the document itself, but if you click on the link to the company, all becomes clear. The reason this company is saying that all these things are problems with PDF is because their company is in the business of extracting raw text from PDF. Ignoring all the designers efforts to place things in specific places to make things pleasing for a human to read, etc... They don't want any of that. They just want to extract the raw text so they can data mine it and sell that as a service.

5 comments

You are not reading a PDF document, you are reading a visual representation constructed by a program which is made by people who tear their hair out.

PDF “specification” is not a specification, it only documents the happy path. It never states that behavior of Acrobat remains the holy truth, but in practice undocumented bug-for-bug compatibility is assumed. (We're talking about most basic, universally supported features here.) If ISO was worth their salt, they would at least try to codify the de facto behavior instead of stamping their name on some Adobe-provided document, then it would be horrible but fixed format. A collection of tests would be nice to have, too.

Of course, this “history” is just a promotional leaflet, which describes the “layman approach” they tried to construct. It's a fault not to mention that PDF was, and still is, a foundation of digital print industry, where big vendors solve compatibility problems for mere mortals, and therefore create unwritten rules of what should and shouldn't work.

It is also ironic that they praise the Web, but have to use Web Archive to link to the article from the ancient year of… 2020.

Mostly, the page description language (inside PDFs) says "move to point (x, y) and write text foo" or "move to point (x, y) and draw glyph bar".

The reason that many PDFs produce garbage when extracting text is because the underlying document doesn't include fonts, every letter is a drawn glyph. This is most common in older (1990s) pdfs generated on UNIX systems.

Since the page description language is saying "write text foo", that text is broken up by the generating software, so there is not necessarily a whole line of text as a human would see it.

And some PDFs are impossible to extract text from because they've flattened the page into an image. Law firms are notorious for doing this - to provide the documents exactly as required/specified during discovery, but make them impossible for the text to be extracted. Basically it is a fax - every page is a TIFF image (because it is harder to OCR than a JPEG, although JBIG2 has its own flaws [0]).

I've been working with PDF projects off and on since the late 90s. The standard tries to be everything for everybody and that makes it a Charlie Foxtrot from top to bottom (if you've ever written an object viewer/editor to dig inside PDFs, you know exactly what I'm referring to). It is a great spec for making a document that appears the same no matter where it is viewed or printed. But I always treat it as a sausage: you can turn the cow into a sausage, but you can't turn that sausage back into a cow.

0 - https://www.theverge.com/2013/8/6/4594482/xerox-copiers-rand...

I wonder if this explains why trying to copy and paste text out of a PG&E bill would always come back as gobbledygook when I used to receive such bills in the past.
DJVU works like that too; -BUT- you can embed the text with some internal operation in both kind of decuments.

On GNULinux/BSD you have OCRmyPDF to do that.

> fonts that are designed to obfuscate the internal text, e.g. re-arranging characters or splitting glyphs up in strange ways, etc. It's worth noting that with the exception of the spaces, these techniques are used deliberately to stop people extracting text or the copyrighted fonts from the document.

Maybe (and, for the fonts, likely), but I don’t think it’s the only reason. Subsetting embedded fonts makes PDFs smaller, often a lot smaller (why embed an entire font because the document uses a single glyph of it as a bullet point? Why would one include Chinese, Japanese, etc glyphs if the document doesn’t use them?)

Even if it’s possible to do that without changing the code point to glyph mapping (is it? I don’t know enough of fonts to answer that), implementing it may be simpler or result in smaller files if one makes the embedded font dense in code points (I tried finding an answer, but soon remembered how complex fonts are, and gave up)

And of course, modern tools _should_ output accessible PDF documents, which means text extraction _should_ work. I wouldn’t know how well that works in reality, but have my doubts.

Actually most pdfs are formatted in a good way and it’s easy to extract text. The stupid stuff is just copy encryption, which is just a stupid feature (because pdf viewers can ignore it) I have no idea why somebody hates pdfs for extracting data, when stuff like doc,xls (the old format) is clearly way worse. Pdf sometimes has its quirks but the 2.0 version clearly cleans up a lot of the messes
It’s usually easy to extract individual strings from a PDF, normally single lines, but it can be quite hard to understand how those form into longer paragraphs, especially if the page has multiple columns and inline figures.

It’s also easy to create a PDF that it is hard to extract text from, not through an deliberate attempt to enforce copy protection but often simply from attempts to compress the size of the file as you may not want to store the entirety of a font in a document.

I’ve been on both ends of this, generating documents and consuming them, and I think we probably created something that allowed for much easier text extraction, but it’s far too late now.

> I have no idea why somebody hates pdfs for extracting data, when stuff like doc,xls (the old format) is clearly way worse.

Is this sarcasm?

AFAIK, pdf is deliberately designed to not give a F about semantics. There is no way to determine what is part of what in a pdf document. All you got is association by adjacency.

Hasn't it always been that way? Has something changed?

Most often the text runs you would extract from a PDF are in the order you would read the text. Probably because the word processor or application that created the PDF dumped the text into the capturing PDF context from its own text container that, in fact, contains the text in the order the word processor would display it (editing, searching, text selection in the creating app obviously benefit if the text container is in reading order).

When text is not in reading order within a PDF page it is often headers, footers, captions, callouts, block quotes....

There are I believe features in the modern PDF spec to allow for accessibility that would give you more structure to that raw text. I am not sure that this is a widely used feature when creating PDFs though.

It's true that data is often written out in a logical order, but like you say, that's only because the program that created it was designed that way. I've definitely seen PDF files where tabular data is almost in a logical order but every now and then cells have been jumbled around.

But what I've definitely seen is documents where the characters are deliberately jumbled up and a custom font so that visually everything looks fine. I know this, because there was one specific case where I wanted to extract about 5000 words in a vocabulary list and it was hard to decipher. They'd used several such fonts in the single document as well, so there wasn't a one-to-one mapping of the text encryption. They'd also put a watermark under the list, so you also couldn't easily do OCR of the final screen image either.

To be sure, the content-creator can run riot with the PDF spec and make it suck for everyone but a human reading the screen or printed page. Fortunately I would say 99% of PDFs are much better behaved than that.
Except it's a poorly formatted document because it's not formatted to fit screens of different width, which is huge (phones are a thing)

Also you haven't solved another huge fail of the most basic digital workflow - copy&paste - by pointing at the motivation of the author since "except spaces" ruin it for everyone, not just professional data extractors