| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by zem 437 days ago
	out of curiosity, wouldn't an automated spell check pass help catch ocr errors? e.g. "tne" would be caught immediately.

5 comments

generationP 437 days ago

The most confusing errors are the ones spellcheck doesn't catch because they transform a word into a valid word. But it's them that we want the least.

link

zem 437 days ago

true, it wouldn't do a 100% job, but it would be another line of defense. the reason I was wondering about it was that the gp cited an example that was easy for humans to miss, but would be caught at once with a spell checker.

there are also statistical methods to detect words that are changed into other, valid words - check out the grammar checker in google docs for instance. again, not 100%, but every bit helps.

link

Wurdan 437 days ago

It would probably also throw out a lot of false positives which would take time to check. Especially in works of fiction, writers could take liberties with non-standard spelling.

link

bluGill 436 days ago

Unless tne is an abbreviation and so it should pass. Names are a common place where people make up weird spellings and so spell checkers are annoying. I have terrible spelling, and yet most of the time I run spellcheck it is tripping up on words that are spelled correct but not in the dictionary (in large part because I run spell check after each revision: words spelled wrong . Add to dictionary means that my dictionary is polluted with words that only apply to one document and would be wrong in the next)

link

pulkitsh1234 437 days ago

An LLM-based spellchecker would've caught it for sure. I am working on one here: https://github.com/pulkitsharma07/spelltastic.io, If anyone has suggestions on how this can help in Project Gutenberg / Standard Ebook's workflows, please reach out to me / open an issue.

I have seen that LLMs are pretty good at understanding context/domain / theme-specific terms, so their spellchecking is pretty good.

link

robin_reala 435 days ago

For future reference this approach was tested at https://github.com/standardebooks/tools/issues/815. No errors were found in a selection of books.

link

fernly 436 days ago

Running spellcheck is a standard step on every page of proofreading. There's a "wordcheck" button in proofing UI.

link

contact9879 437 days ago

the distributed proofreaders process does include a mandatory spellcheck

link