|
|
|
|
|
by cvz
489 days ago
|
|
As someone who's learning how to do OCR in order to re-OCR a bunch of poorly digitized documents, this will not work with modern OCR. Modern OCR is too good. If you're able to improve the preprocessing and recognition enough, then there's a point at which any post-processing step you do will introduce more errors than it fixes. LLM's are particularly bad as a post-processing step because the errors they introduce are _designed to be plausible_ even when they don't match the original text. This means they can't be caught just by reading the OCR results. I've only learned this recently, but it's something OCR experts have known for over a decade, including the maintainers of Tesseract. [1] OCR is already at the point where adding an LLM at the end is counterproductive. The state of the art now is to use an LSTM (also a type of neural network) which directly recognizes the text from the image. This performs shockingly well if trained properly. When it does fail, it fails in ways not easily corrected by LLM's. I've OCR'ed entire pages using Tesseract's new LSTM engine where the only errors were in numbers and abbreviations which an LLM obviously can't fix. [1] https://tesseract-ocr.github.io/docs/Limits_on_the_Applicati... |
|
I also have even recent extensive experience: I get an important amount of avoidable errors.
> at which any post-processing step you do will introduce more errors than it fixes ... the errors they [(LLMs)] introduce are _designed to be plausible_
You are thinking of a fully automated process, not of the human verification through `diff ocr_output llm_corrected`. And even then, given that I can notice errors that an algorithm with some language proficiency could certainly correct, I have reasons to suppose that a proper calibration of an LLM based system can achieve action over a large number of True Positives with a negligible amount of False Positives.
> LSTM
I am using LSTM-based engines, and on those outputs I have stated «I get an important amount of avoidable errors». The one thing that could go in your direction is that I am not using the latest version of `tesseract` (though still in the 4.x), and I have recently noticed (already through `tesseract --print-parameters | grep lstm`) that the LSTM engine evolved within 4.x, from early to later.
> numbers and abbreviations which an LLM obviously can't fix
? It's the opposite: for the numbers it could go (implicitly) "are you sure, I have a different figure for that" and for abbreviations, the LLM is exactly the thing that should guess them ot of the context. The LLM is that thing that knows that "the one defeated by Cromwell should really be Charles II-staintoberemoved instead of an apparent Charles III".