A guide to OCR with Tesseract, OpenCV and Python

Y	Hacker News new \| ask \| show \| jobs

	A guide to OCR with Tesseract, OpenCV and Python (nanonets.com)
	130 points by ole_gooner 2372 days ago

9 comments

sandreas 2372 days ago

The preprocessing step uses otsu, which is pretty inaccurate, because it uses only one single threshold value for the whole image. An adaptive thresholding algorithm (like Sauvola or Wolf binarization) could improve the whole preprocessing A LOT on many images, that are not only black and white. See https://github.com/chriswolfvision/local_adaptive_binarizati... for details.

Other nice resources: - https://www.researchgate.net/publication/306352164_Watershed... - https://isi.edu/integration/papers/chiang11-icdar.pdf

link

onemorelizard 2372 days ago

The article is pretty great. Most tutorials I've found online working with OCR basically run you through the installation process and a basic few CLI commands or an introduction to their C++ API. This one takes you through some interesting details like the bounding box info, template matching bits and playing around with the config. The training process for tesseract, though not included in this seems like a task.

link

m3nu 2372 days ago

Automatically finding specific boxes/fields is quite interesting. I maintain a Python package[1] that processes invoices using a template/regex-based approach. It works alright, but eventually runs into some limitations. The box-model from the article could push it further.

1: https://github.com/invoice-x/invoice2data

link

mpeg 2372 days ago

Hey this is great, I made something ad-hoc to do this for a client and might borrow some ideas to improve it.

I heavily leaned on AWS Textract for the bounding boxes though, as the kind of data I had to extract didn't have very well defined fields. I used some of the techniques described in this link [0] particularly around table extraction.

I really like how you define the fields in YAML though, I defined mine in code and it ended up being a bit messy.

[0]: https://datascience.blog.wzb.eu/2017/02/16/data-mining-ocr-p...

link

MayeulC 2372 days ago

I've skimmed over the article, which seemed to give a rather sincere overview of the OCR market, then tesseract, the way it works, and how to interface it with python.

However, the article is also an advertisement for nanonets, so they also chose to highlight the complexity side a bit before putting themselves forward.

As someone who hadn't heard of them before, this could be written in the title. They seem to lease (I prefer that term) an API to do OCR with a couple rules and templates depending on your use case.

I am not entirely sure what they expect with this? Maybe SEO or to hijack search results?

link

jftuga 2371 days ago

OCRmyPDF (based on Tesseract) works very well: https://github.com/jbarlow83/OCRmyPDF

OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched.

link

wswope 2372 days ago

Potentially helpful notes:

The character whitelist/blacklist functionality doesn't work for the default LSTM-based engine.

Regarding preprocessing, upscaling the image size can have a dramatic impact on performance.

IIRC tessdata_fast (which the article mentions) is the default that ships with most prebuilt versions of Tesseract, so you probably don't need to mess with that. In my use case, I found that tessdata_best actually performed slightly worse in terms of accuracy.

link

udayrddy 2372 days ago

"Let's assume you've created an OCR model to detect Name, Address, DOB from Drivers Licenses. Since there are 3 categories in this model, each API call will be priced at $0.01 * 3 = $0.03/image. So if you're on the Medium plan, you'll get 99/0.03 = 3300 API calls."

Woah !! That is insanely high priced.

link

ngcc_hk 2371 days ago

Compatibke with TensorFlow js?

link

Aaargh20318 2372 days ago

The title is a bit misleading (and doesn't match the linked article). This isn't about building an OCR engine, it's about using an existing one.

link

dang 2372 days ago

Yes. We've changed the title to that of the article. From the site guidelines: "Please use the original title, unless it is misleading or linkbait; don't editorialize."

Submitted title was "Building an OCR Engine with Python and Tesseract", which broke that guideline, assuming the page title didn't change.

link

tasogare 2372 days ago

Yes and no. Everyone who know what Tesseract is will understand that no deep technical details will be discussed. For those who don’t however it is indeed a bit baity.

link