Hacker News new | ask | show | jobs
by 867-5309 2064 days ago
if only .PDFs could easily be converted back to a useful raw format. parsing them is a bloody minefield, irregularly stuffed with proprietary metadata galore
2 comments

PDF is a printing format, not an editing format - hence the trouble when you want to convert it back to an editable document. It's the same as going back from .JPEG to .BMP, you'll never get back your original pixels.
yes but unlike .bmp -> .jpg compression is optional. you can display exactly the same content and layout in e.g. HTML, but there is no standard to govern or reverse this
pdftotext -layout
Sometimes works well, depending on the structure and content of the PDF. Other times it's hopeless.

Certainly not a general solution. Indeed, there isn't one, because the design of PDF allows far too many things that can't be reliably deciphered back to the source data.

That's why Adobe is throwing all their ML at it, to try and come up with something that guesses near enough right more of the time.

as with the hundreds of other converters, it probably will produce varying results