| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by mschuster91 2 hours ago

> I would definitely understand post processing, like extracting data, answering question .. etc, but why re-doing the OCR engine itself?

Well... the idea seems to be (as far as I understand it, at least) that optical errors and artifacts can now be compensated as the OCR engine is now context-aware.

Say, for example, some random long ass name chemical. It's not going to be in a word correction database, but a context-aware engine (ideally, one that has been supplemented with chemistry data) can now correct "bad" reads of the chemical's name.

Of course, there remains the issue of how to prevent the infamous Xerox bug [1]...

[1] https://www.dkriesel.com/en/blog/2013/0802_xerox-workcentres...