|
|
|
|
|
by tuddman
2216 days ago
|
|
I also built a system that extracted structured and unstructured text from images/pdfs. For the generated pdfs, I found pdftotext could pull with 100% fidelity, and so that was 'option #1'. for scanned-images-saved-as-pdfs, then tesseract could sometimes extract with 90+% accuracy. But never 100%. Combining pdftotext (with the right flags set) with some of the other associated pdf-tools, we were able to achieve what we were after: Building a searchable DB and auto-informing corpus of information derived entirely from various pdf sources. All in-house. No sending off to 3rd parties. |
|
Arrived to a similar conclusion although never have bothered with DB or any web interface running locally. Simply grepping the text files works flawlessly for me.