Hacker News new | ask | show | jobs
by nightcracker 2337 days ago
It's so successful precisely because it doesn't have semantics. It is a print format with one goal: show the output as desired. Semantics only confuse and limit this goal.

Imagine how bogged down and limited vector graphics would be if every element had to have semantic meaning? "This line connects the <body> of the <car> to the 13th <spoke> on the <wheel>".

2 comments

ISTM semantics could have been added as a supplement to PDF à la microformats for HTML, which wouldn't have hurt anything. It's easy for processors to just skip some well-defined tokens. Of course, few producers of PDFs for public consumption would have incentives to do that, so it probably never would have taken off...
Just look at how popular it is to embed/attach arbitrary files to pdf documents.

(it's well supported by Adobe tools...)

Yeah, so by eliminating semantic concerns, pdf has achieved transcendence as a document format.

But by its nature text is intrinsically semantic. I am just surprised that a document format utterly free of semantics has lasted so long. Perhaps because we (as people and organizations) can't agree on the structure of documents?

Another way to see this, perhaps, is as the failure of the promise of xml and the ecosystem around it? In the late 90's many of us thought that all documents would be xml for content and styling would be through xslt or even it's big sister, xsl. Well, THAT went nowhere despite all the W3C meetings and papers.

It's interesting you brought up graphics as an analogy. It's true that you can have graphics which are literally just lines and that's adequate for many needs. However, modern CAD drawing systems increasingly use notions of 2D/3D objects and a disciplined series of transformations. They call it "parametric modeling" and it's where where all drawing consist of a series of transformations that can be represented in a timeline. I suspect modern parametric model CAD can very much be semantic.

Books have about the same amount of semantic information as pdf. It's probably just habit.

I think more than lack of agreement, it's just that there aren't really universal document structures. There's relatively useful chunks like paragraphs that are more or less universal (at least for a given language), but those don't need much structure to be clear.

It isn’t in the interests of word processors to round-trip through pdf. If you look at the PDFs the mainstream word processors generate, you see some of them actively trying to stop text extraction. It’s like an obfuscation arms race. They include white-on-white text, and jump all over the page positioning text so no whole words occur in the source etc. Sad but true.