| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by zrail 4951 days ago
	Not really. The problem is that PDF is basically a destination format. Converting to PDF strips all of the semantics out of it, leaving you with plain text, fonts, and boxes. The latest versions of the official Adobe Acrobat Reader are able to convert PDF to Doc but I have no idea what the quality is like.

2 comments

mjcohenw 4950 days ago

Every time I have used Acrobat to convert PDF to Word, the only usable parts have been the tables. The rest is generally garbage.

Fortunately, the tables were the only parts I wanted! I needed to get them from the PDF into text (csv) form. So, from Word, I copied the tables, pasted them into Excel, and saved that as csv. Easy as 1-2-3-4-5!

link

sliverstorm 4951 days ago

It's probably possible to do, but nobody's needed one badly enough to do it.

link

zrail 4951 days ago

There is actually an Apache project that can extract the text from a PDF. It does a passable job, but like I said all of the formatting is gone.

http://pdfbox.apache.org/userguide/text_extraction.html

link

Toshio 4951 days ago

There is a very good pdf-to-html converter at [0], so it's a two-step process.

[0] https://github.com/coolwanglu/pdf2htmlEX

link