Hacker News new | ask | show | jobs
by jamescampbell 1623 days ago
I built something similar in the past using TesseractOCR and Apache Tika and PyPDF2 / QPDF. The idea is sound. An API based OCR already exists in Apple / Microsoft / and Google so I am not sure this would be that useful. There would be no way for the user to trust that you are not taking the data you are OCR'ing and using it. If you can apply some type of one way encryption of the content and prove it via open source code (like Whisper Systems does for Signal) which seems like overkill and lots of effort for a free app.
1 comments

> An API based OCR already exists in Apple / Microsoft / and Google

Where can I reach them? Thx.

https://centraluseuap.dev.cognitive.microsoft.com/docs/servi...

We have build a web client + REST API that allows to use the API for free for small personal projects.

https://konfuzio.com/en/ocr-api/

It supports handwriting, correction of HOCR text via the webrowser, automated language detection.

We use the text to allow large enterprises to train document categorization and data extraction AI in a low/now code UI.

Disclaimer: I'm one of the founders.

Thanks, I will check this out. Nice to explain the background also.