Hacker News new | ask | show | jobs
by kargo 4010 days ago
Does it have to be open-source? If free, but not trainable and restricted to Windows apps/phone is good enough, then I recommend the Microsoft OCR library. It gives you very, VERY good results out of the box. An excellent piece of work from Microsoft Research. To test it, see for example https://ocr.a9t9.com/ which uses Microsoft OCR inside.

And for comparison, an OCR application with Tesseract inside: It has a dramatically lower text recognition rate: http://blog.a9t9.com/p/free-ocr-windows.html

(Disclaimer: both links are my little open-source side projects)

4 comments

Are you talking about this library from Microsoft: https://www.nuget.org/packages/Microsoft.Windows.Ocr/ ?
Yes
That https://ocr.a9t9.com/ link worked pretty impressively. I did a screen shot and uploaded it to to test it out. Nice.
How does the PDF OCR process compare to images? I uploaded a sample PDF with very clear sans-serif text (printed to PDF from a webpage) and there seems to be some odd substitutions. "prohibitecL" instead of "prohibited", "ac" instead of "QC" (as part of an address), random clipping of the first letter in a few lines and random use of a capital i instead of 1.

Overall very good, I'm just wondering if the library is better with image files than PDFs?

Interesting... I see it now. I assume some issue during the PDF to image conversion in the web app. PDF support is just a few days old.

The OCR library itself supports only image formats as input and is "innocent" with regards to this issue ;)

I tried your app with this image: http://cdn.swapweb.com.ar/estilo-web.net/publis/dell_inspiro... and the results were impressive!.

Much much better than what I can get with tesseract. Would love to have it as an API service.