Hacker News new | ask | show | jobs
by thebouv 2803 days ago
My first thought when reading this was it seemed almost over-engineered compared to just using Tika+Tesseract.

I'm not sure what benefit they are getting from using machine learning for this other than "decide whether to try and process this file or not".

Tika + Tesseract seems to be able to do the heavy lifting they spent a lot of time talking about in that article.

2 comments

I worked in a very similar system for a very different company and I tend to think that a good reason to implement your own OCR models (if you can afford it) would be optimizing CPU cost. Tesseract can be quite expensive to run in scale, maxing out 100% for a simple page and taking about 5-30 seconds for full page extraction. Also, most Tesseract pipelines take entire PDF files for processing, whilst you could achieve better latency by processing pages in parallel and merge the results, as they suggest in the post.
Tesseract does not work well out of the box and is usually outperformed by custom models for OCR