Hacker News new | ask | show | jobs
Using Pytesseract to Convert Images into a HTML Site (armaizadenwala.com)
73 points by armaizadenwala 2301 days ago
2 comments

First of all, great work, and thank you for sharing.

The video only shows this working with an image of a text-only page. What happens when there are photos embedded in the image?

Hi! Thank you!

Tesseract is trained to only recognize text from images. I haven't looked into image detection yet though.

This project fits the situation where you need to digitize a bunch of physical copies / scans of documents. Sometimes these documents have images like company logos which would be useful to include in the final html page.

I'll try to take a look into it, it is a wonderful idea for a 2nd part. This current post is geared towards helping others transition into the world of data science with OCR by describing every step of the way.

nice. But why are you attributing tesseract solely to google when it was initially developed by HP ? Does it help marketing nowadays?
I'd argue that one can refer to Tesseract as Google product without being deceptive, as it's been developed by Google since 2006 [1].

[1] https://github.com/tesseract-ocr/tesseract#brief-history