| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by perturbation 2850 days ago
	It would be nice to benchmark the text extraction to a baseline method, say with Apache Tika (https://tika.apache.org/). I would expect the deep learning approach to outperform traditional approaches in terms of accuracy, but it would be good to see accuracy vs. CPU / memory used, etc.

2 comments

milesokeefe 2850 days ago

Tika doesn’t do OCR, it only extracts text content from binary files. For an image it’ll only give you metadata and such.

A better comparison would be against Tesseract or ABBYY FineReader.

EDIT: I wasn't aware that Tika now embeds Tesseract.[1] Still, it's a simple wrapper so the real comparison is against Tesseract.

[1] https://wiki.apache.org/tika/TikaOCR

link

dunham 2850 days ago

For the use-case of search, you can "cheat" and provide multiple answers for each word that you find in the image. Evernote does this. (It has 2-3 options for each word in its ocr results.) I don't know if tesseract supports this mode of operation, nor if Dropbox is doing this.

link

zawerf 2850 days ago

I think they already tried commercial off the shelf OCR software (which they didn't name but I would assume it's ABBYY) before they decided to build their own solution:

https://blogs.dropbox.com/tech/2017/04/creating-a-modern-ocr...

link

mehrdadn 2849 days ago

ABBYY hasn't been all that amazing in my experience. I compared it with Neat Scanner software a few months ago and the latter seemed to do a noticeably better job.

link

thebouv 2850 days ago

My first thought when reading this was it seemed almost over-engineered compared to just using Tika+Tesseract.

I'm not sure what benefit they are getting from using machine learning for this other than "decide whether to try and process this file or not".

Tika + Tesseract seems to be able to do the heavy lifting they spent a lot of time talking about in that article.

link

rodaliste 2849 days ago

I worked in a very similar system for a very different company and I tend to think that a good reason to implement your own OCR models (if you can afford it) would be optimizing CPU cost. Tesseract can be quite expensive to run in scale, maxing out 100% for a simple page and taking about 5-30 seconds for full page extraction. Also, most Tesseract pipelines take entire PDF files for processing, whilst you could achieve better latency by processing pages in parallel and merge the results, as they suggest in the post.

link

tim_sw 2850 days ago

Tesseract does not work well out of the box and is usually outperformed by custom models for OCR

link