Hacker News new | ask | show | jobs
by zem 437 days ago
out of curiosity, wouldn't an automated spell check pass help catch ocr errors? e.g. "tne" would be caught immediately.
5 comments

The most confusing errors are the ones spellcheck doesn't catch because they transform a word into a valid word. But it's them that we want the least.
true, it wouldn't do a 100% job, but it would be another line of defense. the reason I was wondering about it was that the gp cited an example that was easy for humans to miss, but would be caught at once with a spell checker.

there are also statistical methods to detect words that are changed into other, valid words - check out the grammar checker in google docs for instance. again, not 100%, but every bit helps.

It would probably also throw out a lot of false positives which would take time to check. Especially in works of fiction, writers could take liberties with non-standard spelling.
Unless tne is an abbreviation and so it should pass. Names are a common place where people make up weird spellings and so spell checkers are annoying. I have terrible spelling, and yet most of the time I run spellcheck it is tripping up on words that are spelled correct but not in the dictionary (in large part because I run spell check after each revision: words spelled wrong . Add to dictionary means that my dictionary is polluted with words that only apply to one document and would be wrong in the next)
An LLM-based spellchecker would've caught it for sure. I am working on one here: https://github.com/pulkitsharma07/spelltastic.io, If anyone has suggestions on how this can help in Project Gutenberg / Standard Ebook's workflows, please reach out to me / open an issue.

I have seen that LLMs are pretty good at understanding context/domain / theme-specific terms, so their spellchecking is pretty good.

For future reference this approach was tested at https://github.com/standardebooks/tools/issues/815. No errors were found in a selection of books.
Running spellcheck is a standard step on every page of proofreading. There's a "wordcheck" button in proofing UI.
the distributed proofreaders process does include a mandatory spellcheck