Hacker News new | ask | show | jobs
by maxxxxx 3266 days ago
I still don't understand how PDF could become one of the standards for publishing documents. Well structured content gets converted into PDF which loses most of that structure. And then a lot of work is done to guess that structure from PDF and convert it back to a better file format. It just shows that successful solutions don't have to be technically good.
3 comments

The keyword is "publishing" --- as in, producing human-readable physical copies, not electronic ones. It just so happens that the format was relatively suitable for the latter too (because it actually looks like a printed document rendered on the screen --- unlike HTML or other formats around at the time), which is why that use-case became popular. PDF is basically a descendant of PostScript, which was designed to control printers.

(Its PostScript origins may also explain the bizarre mix of text and binary that constitute the file format. For example, page contents are in a relatively free-form PostScript-ish RPN-like textual language, but are found in "content streams" which may be compressed or encoded into a binary format. Data "object" structures include things like '<<'-delimited dictionaries, '[' arrays ']', textual "/Names", and even provisions for comments(!?).

Then there are things like the cross-reference table of all objects in the file, which is an array of fixed-width textual numbers representing file offsets, e.g. "0000001056 00000 n" refers to something 1056 bytes from the start of the file. Reactions of WTF!? from those working with the format for the first time are not uncommon.)

PDF has a feature called Tagged PDF, which allows the document to be annotated with a semantic structure. Almost nobody bothers to generate such PDFs, but the support is there!
Dutch law requires that official documents be published as PDF/A-1a which is a subset of PDF 1.4 that can be archived and must be tagged.
Sadly I think that often the publishers actually want it that way, i.e. the they do not want the data to be easily parsable...
I think it's more that they want consistency in rendering across devices and media.
For legal documents (where PDF was used first as far as I know) this may make sense but for manuals and other documents it makes no sense at all.