| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by peterburkimsher 3109 days ago

Pingtype is my program for learning Chinese. I tried to use Tesseract to recognise some Chinese text, taken directly from a PDF that couldn't copy-paste for some reason. The results were awful.

I tried again with English text. I wanted a word list from a book that helps people learn English, so I took photos of the index. The format is word....page #, in two columns.

The results were just as bad.

I've given up on OCR, and decided I have to transcribe everything by hand. I only do it in my free time, and it's been taking months.

Is there any tool that can take a photo of a book where the pages curl towards the middle, and "flatten" it so that OCR will work better?