Hacker News new | ask | show | jobs
A guide to OCR with Tesseract, OpenCV and Python (nanonets.com)
130 points by ole_gooner 2372 days ago
9 comments

The preprocessing step uses otsu, which is pretty inaccurate, because it uses only one single threshold value for the whole image. An adaptive thresholding algorithm (like Sauvola or Wolf binarization) could improve the whole preprocessing A LOT on many images, that are not only black and white. See https://github.com/chriswolfvision/local_adaptive_binarizati... for details.

Other nice resources: - https://www.researchgate.net/publication/306352164_Watershed... - https://isi.edu/integration/papers/chiang11-icdar.pdf

The article is pretty great. Most tutorials I've found online working with OCR basically run you through the installation process and a basic few CLI commands or an introduction to their C++ API. This one takes you through some interesting details like the bounding box info, template matching bits and playing around with the config. The training process for tesseract, though not included in this seems like a task.
Automatically finding specific boxes/fields is quite interesting. I maintain a Python package[1] that processes invoices using a template/regex-based approach. It works alright, but eventually runs into some limitations. The box-model from the article could push it further.

1: https://github.com/invoice-x/invoice2data

Hey this is great, I made something ad-hoc to do this for a client and might borrow some ideas to improve it.

I heavily leaned on AWS Textract for the bounding boxes though, as the kind of data I had to extract didn't have very well defined fields. I used some of the techniques described in this link [0] particularly around table extraction.

I really like how you define the fields in YAML though, I defined mine in code and it ended up being a bit messy.

[0]: https://datascience.blog.wzb.eu/2017/02/16/data-mining-ocr-p...

I've skimmed over the article, which seemed to give a rather sincere overview of the OCR market, then tesseract, the way it works, and how to interface it with python.

However, the article is also an advertisement for nanonets, so they also chose to highlight the complexity side a bit before putting themselves forward.

As someone who hadn't heard of them before, this could be written in the title. They seem to lease (I prefer that term) an API to do OCR with a couple rules and templates depending on your use case.

I am not entirely sure what they expect with this? Maybe SEO or to hijack search results?

OCRmyPDF (based on Tesseract) works very well: https://github.com/jbarlow83/OCRmyPDF

OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched.

Potentially helpful notes:

The character whitelist/blacklist functionality doesn't work for the default LSTM-based engine.

Regarding preprocessing, upscaling the image size can have a dramatic impact on performance.

IIRC tessdata_fast (which the article mentions) is the default that ships with most prebuilt versions of Tesseract, so you probably don't need to mess with that. In my use case, I found that tessdata_best actually performed slightly worse in terms of accuracy.

"Let's assume you've created an OCR model to detect Name, Address, DOB from Drivers Licenses. Since there are 3 categories in this model, each API call will be priced at $0.01 * 3 = $0.03/image. So if you're on the Medium plan, you'll get 99/0.03 = 3300 API calls."

Woah !! That is insanely high priced.

Compatibke with TensorFlow js?
The title is a bit misleading (and doesn't match the linked article). This isn't about building an OCR engine, it's about using an existing one.
Yes. We've changed the title to that of the article. From the site guidelines: "Please use the original title, unless it is misleading or linkbait; don't editorialize."

Submitted title was "Building an OCR Engine with Python and Tesseract", which broke that guideline, assuming the page title didn't change.

Yes and no. Everyone who know what Tesseract is will understand that no deep technical details will be discussed. For those who don’t however it is indeed a bit baity.