Hacker News new | ask | show | jobs
by unnouinceput 2064 days ago
This has something of a misleading argument in it in the form that PDF is the "basis" for document world. PDF is not the basis.

Lemme explain: for each format there is a basis and there is the most used format. For sound that's .WAV / .MP3; for pictures that's .BMP / .JPEG (or .PNG if you're a purist).

And for documents that's .RTF / .PDF. You see a PDF is not the absolute basis, it's just the most convenient trade between usability and fidelity. Nobody except snobs wants pure .WAV files for their preferred songs and everybody uses .MP3 instead. If you want the absolute purest form of a document, you use .RTF

My 2 cents.

3 comments

The analogy is not particularly apt.

For one thing, PDF can encode information that rtf cannot.

There's also lots of approaches to document layout (the underlying descriptions of what should appear, not just different styles).

Pedantically, the analogy works somewhat better for postscript than rtf, but not really, except maybe the bmp->png part.

if only .PDFs could easily be converted back to a useful raw format. parsing them is a bloody minefield, irregularly stuffed with proprietary metadata galore
PDF is a printing format, not an editing format - hence the trouble when you want to convert it back to an editable document. It's the same as going back from .JPEG to .BMP, you'll never get back your original pixels.
yes but unlike .bmp -> .jpg compression is optional. you can display exactly the same content and layout in e.g. HTML, but there is no standard to govern or reverse this
pdftotext -layout
Sometimes works well, depending on the structure and content of the PDF. Other times it's hopeless.

Certainly not a general solution. Indeed, there isn't one, because the design of PDF allows far too many things that can't be reliably deciphered back to the source data.

That's why Adobe is throwing all their ML at it, to try and come up with something that guesses near enough right more of the time.

as with the hundreds of other converters, it probably will produce varying results
Even a snob wouldn't want a WAV file; FLAC is lossless.
Sometimes it's not about preferences but about what's most widely supported. For example, my Octatrack only supports wav/aiff files so that's what I'm stuck with.