Hacker News new | ask | show | jobs
by daemonologist 701 days ago
Tesseract works great for pure label-the-characters OCR, which is sufficient for books and other sources with straightforward layouts, but doesn't handle weird layouts (tables, columns, tables with columns in each cell, etc.) People will do absolutely depraved stuff with Word and PDF documents and you often need semantic understanding to decipher it.

That said, sometimes no amount of understanding will improve the OCR output because a structure in a document cannot be converted to a one-dimensional string (short of using HTML/CSS or something). Maybe we'll get image -> HTML models eventually.