| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by 867-5309 2064 days ago
	if only .PDFs could easily be converted back to a useful raw format. parsing them is a bloody minefield, irregularly stuffed with proprietary metadata galore

2 comments

unnouinceput 2063 days ago

PDF is a printing format, not an editing format - hence the trouble when you want to convert it back to an editable document. It's the same as going back from .JPEG to .BMP, you'll never get back your original pixels.

link

867-5309 2063 days ago

yes but unlike .bmp -> .jpg compression is optional. you can display exactly the same content and layout in e.g. HTML, but there is no standard to govern or reverse this

pdftotext -layout

Sometimes works well, depending on the structure and content of the PDF. Other times it's hopeless.

Certainly not a general solution. Indeed, there isn't one, because the design of PDF allows far too many things that can't be reliably deciphered back to the source data.

That's why Adobe is throwing all their ML at it, to try and come up with something that guesses near enough right more of the time.

link

867-5309 2063 days ago

as with the hundreds of other converters, it probably will produce varying results

link