| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by shiredude95 2171 days ago

I was building an image search engine[0] a while back and faced the same issues you mentioned with OCR. What i realized is tesseract[1](one of the more popular ocr framework) works so long as you are able to provide it data similar to the one it was trained on.

We were basically trying to transcribe message screenshots which should have been relatively straightforward given the homogeneity of the font. But this was not the case as tesseract was not trained in the layout of msg screenshots. The accuracy of raw tesseract on our test dataset was somehwere about 0.5-0.6 BLEU.

Once we were able to isolate individual parts of the image and feed it to tesseract, we were able to get around 0.9 BLEU on the same dataset.

TLDR;Some nifty image processing is required to make tesseract perform as expected.

[0] (https://www.askgoose.com) [1] (https://github.com/tesseract-ocr/tesseract)