|
|
|
Ask HN: What OCR tool do you use in your project?
|
|
8 points
by vikasr111
1066 days ago
|
|
I am working on a project where I want to extract data from PDF document. Sometimes these are scanned PDF or forms. I am looking for for an OCR tool (paid or open source) which can effectively extract data from poorly scanned documents and forms. What do you use? |
|
There are free / open source tools (like Tesseract), but if you would like to use them, some manual or (semi-)auto preprocessing steps are very important (threshold / binarization, deskew, noise removal[1]) too get nearly comparable results to commercial tools.
Some tesseract based solutions are better integrated with automatic preprocessing, you could take a look at Papermerge or other self hosted document management solutions[2].
There are also commercial SDKs around tesseract with good price point, like Vintasoft OCR[5], which supports automatic preprocessing and delivers a decent quality.
If you don't mind having a (free) clicking adventure with small amounts of documents, you could also try the free verson of PDF X-Change viewer[3], which has a small but pretty good OCR to embedded PDF-Layer option which makes PDFs "searchable". But the embedded OCR data cannot be easily extracted.
The best "no cloud" / offline solution I found, was Abbyy FineReader[4] which also has a command line tool, but if you really want a ready to use, easy and good quality solution, I would go with Google Lens (if you don't mind google)
[1] https://towardsdatascience.com/pre-processing-in-ocr-fc231c6...
[2] https://github.com/awesome-selfhosted/awesome-selfhosted#doc...
[3] https://www.tracker-software.com/product/pdf-xchange-editor
[4] https://www.pdf-xchange.de/pdf-xchange-viewer/
[5] https://www.vintasoft.com/vsocr-dotnet-index.html