| > You are thinking of a fully automated process, not of the human verification through `diff ocr_output llm_corrected`. Fair, and I'm aware that that makes a huge difference in how worthwhile an LLM is. I'm glad you're not doing the annoyingly common "just throw AI at it" without thinking through the consequences. I'm doing two things to flag words for human review: checking the confidence score of the classifier, and checking words against a dictionary. I didn't even consider using an LLM for that since the existing process catches just about everything that's possible to catch. > I am using LSTM-based engines . . . I'm using Tesseract 5.5. It could actually be that much better, or I could just be lucky. I've got some pretty well-done scans to work with. > It's the opposite: for the numbers it could go (implicitly) "are you sure, I have a different figure for that" . . . I honestly don't know what you mean. Are you saying that an LLM would know that a reference to "page 311" should actually be a reference to "page 317" based on context? I think the example you've given makes a lot of sense if you're just using an LLM as one way to flag things for review. |
Not that of your example (the page number): that would be pretty hard to check (with current general agents. In the future, not impossible - you would need some agent finally capable to follow procedures strictly). That the extra punctuation or accent is a glitch and that the sentence has a mistake are more within the realm of a Language Model.
What I am saying is that a good Specialized Language Model (maybe a good less efficient LLM) could fix a text like:
"AB〈G〉 [Should be 'ABC'!] News was founded in 194〈S〉 [Should be '1945'!] after action from the 〈P〉CC [Should be 'FCC'!], 〈_〉 [Noise!], deman〈ci〉ing [Should be 'demanding'!] pluralist progress 〈8〉 [Should be '3'!] 〈v〉ears [should be 'years'] ear〈!〉ier [Should be 'earlier'!]..."
since it should "understand" the sentence and be already informed of the facts.