Ask HN: What OCR tool do you use in your project?

Y	Hacker News new \| ask \| show \| jobs

	Ask HN: What OCR tool do you use in your project?
	8 points by vikasr111 1066 days ago
	I am working on a project where I want to extract data from PDF document. Sometimes these are scanned PDF or forms. I am looking for for an OCR tool (paid or open source) which can effectively extract data from poorly scanned documents and forms. What do you use?

5 comments

sandreas 1066 days ago

It depends on what input amount, format and quality you have.

There are free / open source tools (like Tesseract), but if you would like to use them, some manual or (semi-)auto preprocessing steps are very important (threshold / binarization, deskew, noise removal[1]) too get nearly comparable results to commercial tools.

Some tesseract based solutions are better integrated with automatic preprocessing, you could take a look at Papermerge or other self hosted document management solutions[2].

There are also commercial SDKs around tesseract with good price point, like Vintasoft OCR[5], which supports automatic preprocessing and delivers a decent quality.

If you don't mind having a (free) clicking adventure with small amounts of documents, you could also try the free verson of PDF X-Change viewer[3], which has a small but pretty good OCR to embedded PDF-Layer option which makes PDFs "searchable". But the embedded OCR data cannot be easily extracted.

The best "no cloud" / offline solution I found, was Abbyy FineReader[4] which also has a command line tool, but if you really want a ready to use, easy and good quality solution, I would go with Google Lens (if you don't mind google)

[1] https://towardsdatascience.com/pre-processing-in-ocr-fc231c6...

[2] https://github.com/awesome-selfhosted/awesome-selfhosted#doc...

[3] https://www.tracker-software.com/product/pdf-xchange-editor

[4] https://www.pdf-xchange.de/pdf-xchange-viewer/

[5] https://www.vintasoft.com/vsocr-dotnet-index.html

link

beardyw 1066 days ago

A bit off topic but I've just started using Google Lens to extract whole pages from books with my phone. Near perfect conversion to text is great for taking notes.

link

vikasr111 1066 days ago

Google Lens works great in individual use cases, wonder what they are using behind the scene.

In my case I need to extract data on server side, so a library/API will be most suitable.

link

smoldesu 1066 days ago

I still use Tesseract. It's not the fastest or most-accurate anymore, but it gets what I need off of PDF files.

link

vikasr111 1066 days ago

Does it work well with scanned PDF? In my experiments it was not giving the correct output.

link

james-revisoai 1066 days ago

Explore different page segmentation modes and make sure you are using v4 (it's a massive step up)

link

is_true 1065 days ago

We started using tesseract for a project that needed to extract text from video frames. But in the end we moved to easyocr, as it needed less preprocessing for our use case.

link

itake 1065 days ago

What languages do you need to support? Off the shelf models don't work well on non-Latin languages. You may need to train your own.

link