Hacker News new | ask | show | jobs
by tuddman 2216 days ago
I also built a system that extracted structured and unstructured text from images/pdfs. For the generated pdfs, I found pdftotext could pull with 100% fidelity, and so that was 'option #1'. for scanned-images-saved-as-pdfs, then tesseract could sometimes extract with 90+% accuracy. But never 100%. Combining pdftotext (with the right flags set) with some of the other associated pdf-tools, we were able to achieve what we were after: Building a searchable DB and auto-informing corpus of information derived entirely from various pdf sources. All in-house. No sending off to 3rd parties.
1 comments

> For the generated pdfs, I found pdftotext could pull with 100% fidelity, and so that was 'option #1'. for scanned-images-saved-as-pdfs, then tesseract could sometimes extract with 90+% accuracy.

Arrived to a similar conclusion although never have bothered with DB or any web interface running locally. Simply grepping the text files works flawlessly for me.