Hacker News new | ask | show | jobs
by halomru 3402 days ago
PDF is a perfectly fine and rich digital format. It also allows you to do proper copy and paste, which is much saner than anything paper offers.

Sure, PDF is a light on context clues for automation and is targeted purely at humans. But formats targeted at both computers and humans consistently fail (XML with accompanying XSLT comes to mind), and/or only have terrible tools for creating files (easily parsable, pretty HTML).

Either there is very little real demand or we consistently fail at making alternatives viable.

4 comments

> It also allows you to do proper copy and paste, which is much saner than anything paper offers.

Technically it does that, yes. Rarely do I see people taking advantage of it, though; most of the times I tried to copy some text out of PDF, the result had to undergo a significant cleanup before becoming usable.

I've had similar experiences, as well as worse; one PDF I got from a bank about my student loans a year or so back had ostensibly only text content, but none of it was even able to be selected.
PDF is not very fine. Copy-paste from PDF very often results in complete rubbish, even when it is not deliberately prevented (which the format allows, and then you have to do OCR).
People purposefully disallowing copy-paste isn't a problem with PDF: in other formats they would have embedded a picture, at least with PDF you get the other advantages of proper text: infinite zoom and great compression. Sadly there's also a lot of PDFs that are little more than a picture collection that looks like text, but that's hardly the file format's fault.

It really is a problem with PDF that it's too easy to get a file where copy and paste yields a different result than what's displayed. But this varies widely with the software used for creating the file (e.g. latex ligatures never work in copy-paste)

When the PDF is a picture collection that looks like text that's when PDF is being used correctly, because that's when something was scanned out of paper and put on a paper-like format for computers, PDF.

When people write text and data and tables on the computer then put it on a paper-like format to share that's when the problem happens.

Have you ever actually tried to parse PDF with software? It's a sheer nightmare. PDF often gets produced from text processors that have very rich format information. PDF strips it all out and then you somehow have to recreate it.
PDF has no paragraphs, often not even words. No concept of font notes. It doesn't flow well with different screen sizes.

If you designed a real bad format on purpose, it would be hard to top PDF. Maybe Photoshop files are worse.

Have you seen the spec for .doc and .xls ?
I don't even want to know :)
Oh, you really do! The format for COM object based documents like XLS and DOC is actually a FAT filesystem: https://en.wikipedia.org/wiki/Compound_File_Binary_Format
HTML is a viable alternative. And it is something everyone can parse easily, better yet if the data is tagged with classes somehow.
That's something that should be pushed by the developer community, I think. Perhaps having an HTML client for people who nowadays use PDF writers and readers, with the option to tag data in some easily parseable format (if the data isn't already coming in a table).

This should output a single file and ideally it should have some way of ensuring the author it won't be modified unnoticed (that's one of the features common people use PDF for, today, they think it is something no one can modify) -- perhaps signing it with a key from Keybase would work in the mid term.

https://github.com/iffy/lhtml has something going in this direction.

epub is html-based, and their standards body recently got absorbed by the W3C. I think it would be a fine replacement for some of the uses that PDF gets (such as distributing research papers). Unfortunately I don't see it happening any time soon, PDF is so ubiquitous right now and there's very few tools that let you "save to epub". Chicken & egg.