| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by exhibitapp 1152 days ago
	I've worked extensively in this space. For those looking for just an OCR solution MSFT's offering "read" is by and far the most accurate. Key-value, table and other information extraction is a much harder problem. Anything that can go wrong in production will. Documents with extra pages, rotated, blacked out, fuzzy. There are many steps that go into making document extraction really e2e. The biggest enterprise users are doing thousand+ of pages a minute and also turn document extraction into a scaling distributed systems problem

5 comments

idealism 1152 days ago

A few days ago, IBM announced a new OCR system[1]. Have you by chance compared it to Microsoft's offering? I'm currently looking for the best-in-class OCR solution for scanned PDF documents.

[1]: https://www.ibm.com/cloud/blog/exploring-ibms-new-optical-ch...

link

alsodumb 1152 days ago

Call me biased, but I've learned over time that anything that comes out of the Waston team looks good only in PR statements but sucks at production - especially at tasks like OCR. YMMV.

link

benjaminva 1151 days ago

We currently develop solutions in this area and I believe that isolated OCR is not the solution to go. Things are moving rapidly towards end-to-end processing of documents with huge transformer models and I also believe that multi-modal GPT models will quickly win all usecases. If you guys are interested to work in that topic and are located in the northern Germany region, pop me a message.

link

pvitz 1151 days ago

I would like to extract text from approximately 2000 PDF files (machine generated, not scanned) in which the layout can be different on a file basis. Some have normal paragraphs, others two columns and even three columns. All contain tables, but I am not interested in them. Do you know a good (semi-)automatic solution for this?

link

exhibitapp 1151 days ago

this is a hard problem and will require an enterprise solution unfortunately. If its only 2000 pdfs you might be better outsourcing to an off-shore consulting agency to do it manually

link

pvitz 1151 days ago

Thanks for the reply, good to know that!

link

9dev 1152 days ago

Do you have any recommendations for OCR of receipts and grocery bills? I’ve dreamt of having a little app to analyse grocery spending and distribute bills among multiple people, but every time I checked, the state of receipt OCR was surprisingly too bad for this…

link

puika 1152 days ago

Last I checked I saw a grocery bill example using https://github.com/mindee/doctr and was fairly accurate. Bear in mind that was last year, hopefully it got even better or there are other libraries

link

j45 1151 days ago

This is a really helpful find thanks.

If there are any other libraries folks have seen out there like this, I’d love to try them out.

link

themantalope 1151 days ago

The paddlepaddle project has nice models. Not well documented though and can be hard to use, so proceed at your own risk. But it is popular.

link

d911 1151 days ago

i am using epap. It has a pretty good OCR and you can export in CSV. https://apps.apple.com/de/app/epap-kassenbon-haushaltsbuch/i...

link

nextworddev 1152 days ago

Do they have human workers for those hard to solve cases in the loop?

link

exhibitapp 1151 days ago

Yes the solution i worked on had an interface for HITL

link