Using Pytesseract to Convert Images into a HTML Site

Y	Hacker News new \| ask \| show \| jobs

	Using Pytesseract to Convert Images into a HTML Site (armaizadenwala.com)
	73 points by armaizadenwala 2301 days ago

2 comments

markdown 2301 days ago

First of all, great work, and thank you for sharing.

The video only shows this working with an image of a text-only page. What happens when there are photos embedded in the image?

link

armaizadenwala 2301 days ago

Hi! Thank you!

Tesseract is trained to only recognize text from images. I haven't looked into image detection yet though.

This project fits the situation where you need to digitize a bunch of physical copies / scans of documents. Sometimes these documents have images like company logos which would be useful to include in the final html page.

I'll try to take a look into it, it is a wonderful idea for a 2nd part. This current post is geared towards helping others transition into the world of data science with OCR by describing every step of the way.

link

riedel 2301 days ago

nice. But why are you attributing tesseract solely to google when it was initially developed by HP ? Does it help marketing nowadays?

link

netgusto 2301 days ago

I'd argue that one can refer to Tesseract as Google product without being deceptive, as it's been developed by Google since 2006 [1].

[1] https://github.com/tesseract-ocr/tesseract#brief-history

link