Hacker News new | ask | show | jobs
by milesokeefe 2811 days ago
Tika doesn’t do OCR, it only extracts text content from binary files. For an image it’ll only give you metadata and such.

A better comparison would be against Tesseract or ABBYY FineReader.

EDIT: I wasn't aware that Tika now embeds Tesseract.[1] Still, it's a simple wrapper so the real comparison is against Tesseract.

[1] https://wiki.apache.org/tika/TikaOCR

2 comments

For the use-case of search, you can "cheat" and provide multiple answers for each word that you find in the image. Evernote does this. (It has 2-3 options for each word in its ocr results.) I don't know if tesseract supports this mode of operation, nor if Dropbox is doing this.
I think they already tried commercial off the shelf OCR software (which they didn't name but I would assume it's ABBYY) before they decided to build their own solution:

https://blogs.dropbox.com/tech/2017/04/creating-a-modern-ocr...

ABBYY hasn't been all that amazing in my experience. I compared it with Neat Scanner software a few months ago and the latter seemed to do a noticeably better job.