Hacker News new | ask | show | jobs
by markdown 2301 days ago
First of all, great work, and thank you for sharing.

The video only shows this working with an image of a text-only page. What happens when there are photos embedded in the image?

1 comments

Hi! Thank you!

Tesseract is trained to only recognize text from images. I haven't looked into image detection yet though.

This project fits the situation where you need to digitize a bunch of physical copies / scans of documents. Sometimes these documents have images like company logos which would be useful to include in the final html page.

I'll try to take a look into it, it is a wonderful idea for a 2nd part. This current post is geared towards helping others transition into the world of data science with OCR by describing every step of the way.