Hacker News new | ask | show | jobs
Ask HN: Good tools for text extraction from PDF
3 points by lucasrp 4432 days ago
Hi guys,

I'm needing a tool that allows me to convert PDF to html files. Since I work with public documents, sometimes the layout from the pdf can be pretty nasty (i've attached some links at the end of this post).

We have a in house soluction forked several years ago from Apache pdfBox. After a while we realized that forking a open source solution isnt the best answer, but kept on going because it worked.

Does anyone have sugestions? We are willing to contribute to the open source project we choose :)

Many thanks!

https://www.evernote.com/shard/s226/sh/17b87c1f-8f18-4b23-96ac-a9fbc2ac8502/ea5618043f3a9c818071bd93df9f74c3

https://www.evernote.com/shard/s226/sh/17b87c1f-8f18-4b23-96ac-a9fbc2ac8502/ea5618043f3a9c818071bd93df9f74c3

2 comments

I've had good luck with the tools that come with xpdf:

http://www.foolabs.com/xpdf/about.html

But some of that is because the source I was pulling text from didn't change the document format much from month to month.

I guess it is the library underneath jeffmould's link.

I have used the following with some success:

http://pdftohtml.sourceforge.net/

Not sure how well maintained it is still, but it did a good job of converting basic PDF files to HTML.

There is also a Google Code product for going from HTML to PDF which works pretty well.