Hacker News new | ask | show | jobs
by scrollaway 3399 days ago
I find it absolutely ridiculous that we have to resort to these kinds of tools :/

We have digital formats, and we decided to standardize document distribution on the one that makes it as hard to extract data as if it were on physical paper.

5 comments

PDF is a perfectly fine and rich digital format. It also allows you to do proper copy and paste, which is much saner than anything paper offers.

Sure, PDF is a light on context clues for automation and is targeted purely at humans. But formats targeted at both computers and humans consistently fail (XML with accompanying XSLT comes to mind), and/or only have terrible tools for creating files (easily parsable, pretty HTML).

Either there is very little real demand or we consistently fail at making alternatives viable.

> It also allows you to do proper copy and paste, which is much saner than anything paper offers.

Technically it does that, yes. Rarely do I see people taking advantage of it, though; most of the times I tried to copy some text out of PDF, the result had to undergo a significant cleanup before becoming usable.

I've had similar experiences, as well as worse; one PDF I got from a bank about my student loans a year or so back had ostensibly only text content, but none of it was even able to be selected.
PDF is not very fine. Copy-paste from PDF very often results in complete rubbish, even when it is not deliberately prevented (which the format allows, and then you have to do OCR).
People purposefully disallowing copy-paste isn't a problem with PDF: in other formats they would have embedded a picture, at least with PDF you get the other advantages of proper text: infinite zoom and great compression. Sadly there's also a lot of PDFs that are little more than a picture collection that looks like text, but that's hardly the file format's fault.

It really is a problem with PDF that it's too easy to get a file where copy and paste yields a different result than what's displayed. But this varies widely with the software used for creating the file (e.g. latex ligatures never work in copy-paste)

When the PDF is a picture collection that looks like text that's when PDF is being used correctly, because that's when something was scanned out of paper and put on a paper-like format for computers, PDF.

When people write text and data and tables on the computer then put it on a paper-like format to share that's when the problem happens.

Have you ever actually tried to parse PDF with software? It's a sheer nightmare. PDF often gets produced from text processors that have very rich format information. PDF strips it all out and then you somehow have to recreate it.
PDF has no paragraphs, often not even words. No concept of font notes. It doesn't flow well with different screen sizes.

If you designed a real bad format on purpose, it would be hard to top PDF. Maybe Photoshop files are worse.

Have you seen the spec for .doc and .xls ?
I don't even want to know :)
Oh, you really do! The format for COM object based documents like XLS and DOC is actually a FAT filesystem: https://en.wikipedia.org/wiki/Compound_File_Binary_Format
HTML is a viable alternative. And it is something everyone can parse easily, better yet if the data is tagged with classes somehow.
That's something that should be pushed by the developer community, I think. Perhaps having an HTML client for people who nowadays use PDF writers and readers, with the option to tag data in some easily parseable format (if the data isn't already coming in a table).

This should output a single file and ideally it should have some way of ensuring the author it won't be modified unnoticed (that's one of the features common people use PDF for, today, they think it is something no one can modify) -- perhaps signing it with a key from Keybase would work in the mid term.

https://github.com/iffy/lhtml has something going in this direction.

epub is html-based, and their standards body recently got absorbed by the W3C. I think it would be a fine replacement for some of the uses that PDF gets (such as distributing research papers). Unfortunately I don't see it happening any time soon, PDF is so ubiquitous right now and there's very few tools that let you "save to epub". Chicken & egg.
That's Adobe. Look at their other formats, and PDF seems to be one of their better ones. Compare to SWF, PSD, AI and so on.

PDF is the successor of PostScript. PostScript is a stack-based programming language where anything can happen, while PDF enforces some document structure and metadata structure on top of it, so you can e.g. at least determine where pagebreaks are, without having to interpret ("run the code of") the whole document.

Still, PDF is simpler than PostScript in the same sense that XML is a simplification of SGML. Jumping from PDF to a well-designed format would be like jumping from XML to JSON or S-Expr.

It's because PDFs have no concept of lines or paragraphs. It's just characters at an x,y co-ord which happen to line up. So figuring out whats a line or a column is a pain in the ass.

That's most likely why copying and pasting sucks too.

Yes, and more when you want to send a PDF based document to a Kindle.
I had to use Tabula to extract a decade of SAT scores from PDFs for each state/year. It was a nightmare, but I managed it. More recently, I was hoping to do something similar with decennial census data, but it was just too much. Far, far too many groups publish data to PDF, which is about as bad as if they just deleted it straight-out. It's very upsetting.
PDF is fucked up beyond all doubt. But there seems to be no better (even if unpopular) alternative.

How do you imagine a better format PDF alternative? On the one hand, we have text-based formats. They are not serialization of the exact rendering. On the other hand, we have ps, which is, probably, too complex to be manipulated as text when rendered. PDF and dejavu do kinda both, even if quite imperfectly.

So how do we construct a file format, which can render a symbol (not necessarily a unicode one) anywhere, pixel-perfect, but still has concept of words, paragraphs, preferably tables and such?

epub is the way to go I think. PDF is an overengineered abomination. It nicely serves the purpose of "there is only one and exactly one way to render this", but then again, just about so does an image.

I also think that, with more love, epub could get there too. It's not an easy problem, but if we can crack SHA-1 I'm sure we can crack this one too :)

I don't see how epub can be pixel-perfect. It's almost as much a markup format, as fb2. Clearly more explanation of how should it be done is in order.
Pixel perfection is not necessary for 99.999% of the cases PDF is used in.
That's just ridiculous statement.
Microsoft XPS?
This is interesting. I never considered this one. How is it inferior to PDF, so that it is so much less widely spread?
It's not as versatile (no forms, for example), but layout- and prepress-wise it seems to be as good as PDF (with the benefit that it retains the structure).