Hacker News new | ask | show | jobs
by georgecmu 1650 days ago
If a dictionary satisfies your definition of a language model, yes, with predictably poor results[1]. If I understand correctly, Google Books approach[2] represented a major improvement in accuracy of automated OCR (and this is for printed text!), but I would venture to say that implementing a language model like this would be far beyond the scope of a 'tiny project'.

[1] https://tesseract-ocr.github.io/docs/Limits_on_the_Applicati...

[2] https://tesseract-ocr.github.io/docs/Improving_Book_OCR_by_A...