| HN Mirror

Is OCR that good yet? From what I've seen, it's good if you have uniform text in a single standard-looking font. When layout or font varies widely, or there's extraneous stuff on the page (for instance: headers, footers, page numbers, marginal notes, same-page footnotes, low quality source material with marks on it, or scan artifacts), quality degrades. Good OCR engines I've seen can still OCR all the things and present them in a somewhat readable text format, but general intelligence (human or AGI) allows quick, automatic recognition of different sections of text that a narrow OCR AI struggles with. A human or AGI knows this text and that text are both blockquotes, or marginal notes, and instinctively attaches semantic meaning to each area, font, style, color encountered. An OCR engine struggles to get beyond blocks of text each with their own margins and no semantic meaning attached, leading to markup hell.

To highlight the limitations, look at an OCR'd version of a technical book with code samples and different fonts and styles that have different meanings, and that has both footnotes and endnotes. The text will be readable, but disorganized, probably inconsistent styling, and even if some footnotes and endnotes are linked by a good engine, I suspect that's less than fully reliable. For the purposes of reading the book, I'd rather have the scanned pdf with page images for reading, with the OCR'd text as the text layer for searching.

Lower-quality source images seem to cause major problems for tesseract, and even ABBYY judging from archive.org text conversions. Those engines confuse more ambiguous letter or punctuation combinations, while humans can still read the images without much trouble.