Hacker News new | ask | show | jobs
by zrail 4951 days ago
Not really. The problem is that PDF is basically a destination format. Converting to PDF strips all of the semantics out of it, leaving you with plain text, fonts, and boxes. The latest versions of the official Adobe Acrobat Reader are able to convert PDF to Doc but I have no idea what the quality is like.
2 comments

Every time I have used Acrobat to convert PDF to Word, the only usable parts have been the tables. The rest is generally garbage.

Fortunately, the tables were the only parts I wanted! I needed to get them from the PDF into text (csv) form. So, from Word, I copied the tables, pasted them into Excel, and saved that as csv. Easy as 1-2-3-4-5!

It's probably possible to do, but nobody's needed one badly enough to do it.
There is actually an Apache project that can extract the text from a PDF. It does a passable job, but like I said all of the formatting is gone.

http://pdfbox.apache.org/userguide/text_extraction.html

There is a very good pdf-to-html converter at [0], so it's a two-step process.

[0] https://github.com/coolwanglu/pdf2htmlEX