Hacker News new | ask | show | jobs
by exhibitapp 1152 days ago
I've worked extensively in this space. For those looking for just an OCR solution MSFT's offering "read" is by and far the most accurate. Key-value, table and other information extraction is a much harder problem. Anything that can go wrong in production will. Documents with extra pages, rotated, blacked out, fuzzy. There are many steps that go into making document extraction really e2e.

The biggest enterprise users are doing thousand+ of pages a minute and also turn document extraction into a scaling distributed systems problem

5 comments

A few days ago, IBM announced a new OCR system[1]. Have you by chance compared it to Microsoft's offering? I'm currently looking for the best-in-class OCR solution for scanned PDF documents.

[1]: https://www.ibm.com/cloud/blog/exploring-ibms-new-optical-ch...

Call me biased, but I've learned over time that anything that comes out of the Waston team looks good only in PR statements but sucks at production - especially at tasks like OCR. YMMV.
We currently develop solutions in this area and I believe that isolated OCR is not the solution to go. Things are moving rapidly towards end-to-end processing of documents with huge transformer models and I also believe that multi-modal GPT models will quickly win all usecases. If you guys are interested to work in that topic and are located in the northern Germany region, pop me a message.
I would like to extract text from approximately 2000 PDF files (machine generated, not scanned) in which the layout can be different on a file basis. Some have normal paragraphs, others two columns and even three columns. All contain tables, but I am not interested in them. Do you know a good (semi-)automatic solution for this?
this is a hard problem and will require an enterprise solution unfortunately. If its only 2000 pdfs you might be better outsourcing to an off-shore consulting agency to do it manually
Thanks for the reply, good to know that!
Do you have any recommendations for OCR of receipts and grocery bills? I’ve dreamt of having a little app to analyse grocery spending and distribute bills among multiple people, but every time I checked, the state of receipt OCR was surprisingly too bad for this…
Last I checked I saw a grocery bill example using https://github.com/mindee/doctr and was fairly accurate. Bear in mind that was last year, hopefully it got even better or there are other libraries
This is a really helpful find thanks.

If there are any other libraries folks have seen out there like this, I’d love to try them out.

The paddlepaddle project has nice models. Not well documented though and can be hard to use, so proceed at your own risk. But it is popular.
i am using epap. It has a pretty good OCR and you can export in CSV. https://apps.apple.com/de/app/epap-kassenbon-haushaltsbuch/i...
Do they have human workers for those hard to solve cases in the loop?
Yes the solution i worked on had an interface for HITL