| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by halomru 3402 days ago

PDF is a perfectly fine and rich digital format. It also allows you to do proper copy and paste, which is much saner than anything paper offers.

Sure, PDF is a light on context clues for automation and is targeted purely at humans. But formats targeted at both computers and humans consistently fail (XML with accompanying XSLT comes to mind), and/or only have terrible tools for creating files (easily parsable, pretty HTML).

Either there is very little real demand or we consistently fail at making alternatives viable.

4 comments

TeMPOraL 3402 days ago

> It also allows you to do proper copy and paste, which is much saner than anything paper offers.

Technically it does that, yes. Rarely do I see people taking advantage of it, though; most of the times I tried to copy some text out of PDF, the result had to undergo a significant cleanup before becoming usable.

link

saghm 3402 days ago

I've had similar experiences, as well as worse; one PDF I got from a bank about my student loans a year or so back had ostensibly only text content, but none of it was even able to be selected.

link

ptaipale 3402 days ago

PDF is not very fine. Copy-paste from PDF very often results in complete rubbish, even when it is not deliberately prevented (which the format allows, and then you have to do OCR).

link

halomru 3402 days ago

People purposefully disallowing copy-paste isn't a problem with PDF: in other formats they would have embedded a picture, at least with PDF you get the other advantages of proper text: infinite zoom and great compression. Sadly there's also a lot of PDFs that are little more than a picture collection that looks like text, but that's hardly the file format's fault.

It really is a problem with PDF that it's too easy to get a file where copy and paste yields a different result than what's displayed. But this varies widely with the software used for creating the file (e.g. latex ligatures never work in copy-paste)

link

fiatjaf 3402 days ago

When the PDF is a picture collection that looks like text that's when PDF is being used correctly, because that's when something was scanned out of paper and put on a paper-like format for computers, PDF.

When people write text and data and tables on the computer then put it on a paper-like format to share that's when the problem happens.

link

maxxxxx 3402 days ago

Have you ever actually tried to parse PDF with software? It's a sheer nightmare. PDF often gets produced from text processors that have very rich format information. PDF strips it all out and then you somehow have to recreate it.

link

maxxxxx 3402 days ago

PDF has no paragraphs, often not even words. No concept of font notes. It doesn't flow well with different screen sizes.

If you designed a real bad format on purpose, it would be hard to top PDF. Maybe Photoshop files are worse.

link

brianwawok 3401 days ago

Have you seen the spec for .doc and .xls ?

link

maxxxxx 3401 days ago

I don't even want to know :)

link

grkvlt 3400 days ago

Oh, you really do! The format for COM object based documents like XLS and DOC is actually a FAT filesystem: https://en.wikipedia.org/wiki/Compound_File_Binary_Format

link

fiatjaf 3402 days ago

HTML is a viable alternative. And it is something everyone can parse easily, better yet if the data is tagged with classes somehow.

link

fiatjaf 3402 days ago

That's something that should be pushed by the developer community, I think. Perhaps having an HTML client for people who nowadays use PDF writers and readers, with the option to tag data in some easily parseable format (if the data isn't already coming in a table).

This should output a single file and ideally it should have some way of ensuring the author it won't be modified unnoticed (that's one of the features common people use PDF for, today, they think it is something no one can modify) -- perhaps signing it with a key from Keybase would work in the mid term.

https://github.com/iffy/lhtml has something going in this direction.

link

scrollaway 3402 days ago

epub is html-based, and their standards body recently got absorbed by the W3C. I think it would be a fine replacement for some of the uses that PDF gets (such as distributing research papers). Unfortunately I don't see it happening any time soon, PDF is so ubiquitous right now and there's very few tools that let you "save to epub". Chicken & egg.

link