Hacker News new | ask | show | jobs
by PaulHoule 1250 days ago
Back in the day company A would send a paper document to company B and naturally somebody would have to retype it. PDF is great for that legacy workflow or anything where you need print output or screen output that exactly resembles print output.

PDF has facilities for tagging documents such that they can be reflowed like HTML so they can be viewed on different sized screens. It is a boon for accessibility but framing the discussion around accessibility as opposed to a better experience for everyone, particularly automated tools, is hard. (e.g. in politics there is the analogy of how we "can't have good things" because policies that are good for everyone get framed as policies that benefit a racial or other group perceived as a "special interest")

I spoke w/ Larry Masinter at Adobe and he told me Adobe would like people who want structured data in their PDF documents to simply attach files to the PDF. A scientific paper could contain a CSV file of the data, for instance, or a business document could contain a JSON or XML document.

Note that "structured" is not a panacea because the structure might not be the same in the two organizations. For exchange of structured data to take place the organizations have to agree on some ontology, something that happens in some industries some of the time, but it isn't free, and when it is not in place people still have an excuse to continue using paper processes or processes that emulate paper processes.

1 comments

Thanks for responding. I'm curious why PDF doesn't have any metadata attached to it that can easily be parsed out by machines. Sigh
Thanks for sharing! Why do you think XMP isn't widely adopted yet?
There has been a lot of politics. It's yet another case study for "why we can't have nice things."

When XMP first came out, Adobe tools would look at all the metadata in, say, an image file (such as EXIF) and re-express it in XMP format. I liked that a lot because I could read that XMP packet with my RDF tools and have complete access to all the metadata with very simple software.

At some point other people in the industry accused Adobe of undermining other metadata standards and Adobe was pressured to only use XMP for data that could not be expressed with EXIF and other formats. This takes away complete and easy-to-work-with metadata unless I write my own tools that can convert the EXIF metadata to XMP and merge it with the XMP which might be in the document.

The semantic web community also has some blame here as it never embraced XMP, if Adobe had had more industry support it might not have nerfed XMP. I very much like how XMP adopted solutions to problems like keeping track of the order of authors that communities like the one behind Dublin Core haven't had the moral fortitude to address... Keeping Dublin Core in the category of "metadata for an elementary school library" as opposed to the world beating solution that XMP and DC could have been.

You might like this thesis:

http://www.bloechle.ch/jean-luc/pub/Bloechle_Thesis.pdf

I made a HN post on this here: https://news.ycombinator.com/item?id=33674525

Unfortunately I contacted the author via youtube and the work is proprietary, owned by the business he either created or sold-to.

Thanks for sharing -- will dive deeper. This has been keeping me up at night recently...