Hacker News new | ask | show | jobs
by crispyambulance 2338 days ago
It was really shocking when I learned that the way pdf works is as you describe, literally fragments of text with positions and essentially no semantics.

I think a lot of folks find this out as I did, when they run into a project where they need to extract info from pdf documents. Without knowing anything about pdf, one can easily assume that it will be possible to do things like "can't we just extract some semantic structures like headings, tables, etc"... but nooo, it don't work that way!

Discovering the true nature of pdf is major WTF moment because we're so conditioned to expect documents to have a semantic structure. It's hard to understand how a standard can take the exact opposite approach and be so successful.

1 comments

It's so successful precisely because it doesn't have semantics. It is a print format with one goal: show the output as desired. Semantics only confuse and limit this goal.

Imagine how bogged down and limited vector graphics would be if every element had to have semantic meaning? "This line connects the <body> of the <car> to the 13th <spoke> on the <wheel>".

ISTM semantics could have been added as a supplement to PDF à la microformats for HTML, which wouldn't have hurt anything. It's easy for processors to just skip some well-defined tokens. Of course, few producers of PDFs for public consumption would have incentives to do that, so it probably never would have taken off...
Just look at how popular it is to embed/attach arbitrary files to pdf documents.

(it's well supported by Adobe tools...)

Yeah, so by eliminating semantic concerns, pdf has achieved transcendence as a document format.

But by its nature text is intrinsically semantic. I am just surprised that a document format utterly free of semantics has lasted so long. Perhaps because we (as people and organizations) can't agree on the structure of documents?

Another way to see this, perhaps, is as the failure of the promise of xml and the ecosystem around it? In the late 90's many of us thought that all documents would be xml for content and styling would be through xslt or even it's big sister, xsl. Well, THAT went nowhere despite all the W3C meetings and papers.

It's interesting you brought up graphics as an analogy. It's true that you can have graphics which are literally just lines and that's adequate for many needs. However, modern CAD drawing systems increasingly use notions of 2D/3D objects and a disciplined series of transformations. They call it "parametric modeling" and it's where where all drawing consist of a series of transformations that can be represented in a timeline. I suspect modern parametric model CAD can very much be semantic.

Books have about the same amount of semantic information as pdf. It's probably just habit.

I think more than lack of agreement, it's just that there aren't really universal document structures. There's relatively useful chunks like paragraphs that are more or less universal (at least for a given language), but those don't need much structure to be clear.

It isn’t in the interests of word processors to round-trip through pdf. If you look at the PDFs the mainstream word processors generate, you see some of them actively trying to stop text extraction. It’s like an obfuscation arms race. They include white-on-white text, and jump all over the page positioning text so no whole words occur in the source etc. Sad but true.